Data and Configuration Arguments#
Configuration Arguments for Data Loading#
When using configuration files, remember that all parameter names must be specified in uppercase per yacs
convention.
The available argument we mainly focused on includes:
data_dir
: Local directory to store and load the dataset. If files are missing, they will be automatically downloaded.Default: Current working directory +
/data
atlas
: The name of the brain atlas used to extract ROI time series. This corresponds to a subfolder insidefc/
.Available options:
"aal"
: AAL Atlas"cc200"
: Craddock 200 ROI Atlas"cc400"
: Craddock 400 ROI Atlas"difumo64"
: DiFuMo 64 components"dos160"
: Dosenbach 160 Atlas"hcp-ica"
: HCP ICA-based components"ho"
: Harvard-Oxford Atlas"tt"
: Talairach-Tournoux
Default:
"cc200"
fc
: The type of functional connectivity embedding to load (file name without extension).Available options:
"pearson"
: Pearson correlation"partial"
: Partial correlation"tangent"
: Tangent embedding"precision"
: Precision (inverse covariance)"covariance"
: Sample covariance"tangent-pearson"
: Hybrid of tangent embedding and Pearson correlation
Default:
"tangent-pearson"
top_k_sites
: Optionally restrict the dataset to the top K sites (by number of subjects). IfNone
, all sites are included.Default:
None
It returns four values, including:
fc_data
(np.ndarray
): Functional connectivity data (vectorized ifvectorize=True
).phenotypes
(pd.DataFrame
): Associated phenotypic information (e.g., site, age, gender).rois
(np.ndarray
): ROI labels associated with the selected atlas.coords
(np.ndarray
): ROI coordinates for visualization purposes.
Categorical Variables from Phenotypic Data#
The following categorical phenotypes are included and will be one-hot encoded for modelling:
SITE_ID
SEX
HANDEDNESS_CATEGORY
EYE_STATUS_AT_SCAN
These variables are first mapped to descriptive labels using the provided MAPPING
dictionary:
SEX
:{1 → MALE, 2 → FEMALE}
HANDEDNESS_CATEGORY
: Includes various representations unified into:"RIGHT"
(including missing values and-9999
)"LEFT"
"AMBIDEXTROUS"
(e.g.,"Mixed"
,"L->R"
,"Ambi"
)
EYE_STATUS_AT_SCAN
:{1 → OPEN, 2 → CLOSED}
Continuous Variables#
The following continuous phenotypes will be optionally standardized:
AGE_AT_SCAN
FIQ
We will explain the available options for standardizing these phenotypes in more detail down below.
Handling Missing Values#
Missing values are handled with the following assumptions and imputation strategies:
HANDEDNESS_CATEGORY
: Missing entries (-9999
orNaN
) are imputed as"RIGHT"
.FIQ
: Missing scores (-9999
) are imputed with a default value of100
.
These choices ensure that the downstream models can operate without interruption while maintaining reasonable assumptions based on domain knowledge.
Target Variable Encoding#
The diagnostic group DX_GROUP
is used to define the target label for classification:
CONTROL
→0
ASD
→1
This binary label is suitable for supervised learning tasks focused on ASD detection.
To do this, the preprocess_phenotypic_data
function handles this functionality for us.
The main arguments for preprocess_phenotypic_data
include:
data
: A DataFrame containing the phenotypic information for each subject. Must include all selected phenotypes such asSEX
,AGE_AT_SCAN
,FIQ
,HANDEDNESS_CATEGORY
,EYE_STATUS_AT_SCAN
, andDX_GROUP
.Type:
pd.DataFrame
of shape(n_subjects, n_phenotypes)
Required
standardize
: Whether to standardize continuous variables (AGE_AT_SCAN
andFIQ
). This helps remove scale-related bias before modeling.Available options:
False
: No standardization (raw values retained)True
or"all"
: Standardize across all subjects"site"
: Standardize within each acquisition site
Default:
False
one_hot_encode
: Whether to one-hot encode categorical variables (SITE_ID
,SEX
,HANDEDNESS_CATEGORY
,EYE_STATUS_AT_SCAN
). This is typically used when feeding the data into machine learning models.Type:
bool
Default:
True
The function returns the following:
labels
(array-like
): The encoded diagnostic labels derived fromDX_GROUP
.0
: CONTROL1
: ASDShape:
(n_subjects,)
sites
(array-like
): Site identifiers corresponding to each subject, useful for site-wise stratification or harmonization.Shape:
(n_subjects,)
phenotypes
(pd.DataFrame
): The cleaned and processed phenotype DataFrame with missing values imputed, categorical variables mapped (and optionally one-hot encoded), and continuous variables optionally standardized.Shape:
(n_subjects, n_selected_phenotypes)
Note: The selected phenotypes include:
SITE_ID
SEX
AGE_AT_SCAN
FIQ
HANDEDNESS_CATEGORY
EYE_STATUS_AT_SCAN
Configuration Arguments for Cross-Validation#
In this tutorial, we specify the following arguments for cross-validation:
split
: Defines the cross-validation strategy.Available options:
"skf"
: Stratified K-fold to maintain label balance in each fold."lpgo"
: Leave p-groups out to evaluate generalization across sites by holding out entire groups (e.g., imaging sites).
Default:
"skf"
num_folds
: The number of folds for"skf"
or the number of groups to leave out in"lpgo"
.Default:
10
num_repeats
: The number of times the k-fold procedure is repeated to obtain more stable estimates (ignored with"lpgo"
).Default:
5
random_state
: Seed for random number generators for reproducibility.Default:
None
Hyperparameter Grid#
We also specify the hyperparameter search strategy and other training parameters for each configuration, including:
classifier
: The base model used for classification.Available options:
"lda"
: Linear Discriminant Analysis"lr"
: Logistic Regression"linear_svm"
: Linear Support Vector Machine"svm"
: Kernel Support Vector Machine"ridge"
: Ridge Classifier (L2-regularized linear model)"auto"
: Automatically selects an appropriate model based on data characteristics.
Default:
"lr"
param_grid
: The hyperparameter grid used for both the classifier and the MIDA domain adapter.To specify MIDA’s parameters, each key in the grid must be prefixed with
domain_adapter__
(e.g.,domain_adapter__mu
).For classifier parameters, no prefix is needed.
If
param_grid
is set toNone
, PyKale will use its default grid, which spans a broad hyperparameter search space. While this may maximize performance, it significantly increases training time.Therefore, it is not recommended to use
param_grid=None
in combination withsearch_strategy='grid'
.Default:
None
nonlinear
: Whether to apply non-linear transformations (non-interpretable).Type:
boolean
Default:
False
search_strategy
: The hyperparameter search method.Available options:
"random"
: Randomly search over finite iterations."grid"
: Search over all possible combinations.
Default:
"random"
num_search_iterations
: The number of hyperparameter combinations to evaluate in randomized search.Default:
1,000
num_solver_iterations
: The maximum number of iterations allowed for solver convergence.Default:
1,000,000
scoring
: A list of performance metrics used during cross-validation.Available options:
"accuracy"
: Accuracy"precision"
: Precision"recall"
: Recall"f1"
: F1-Score"roc_auc"
: Area Under ROC Curve (AUROC)"matthews_corrcoef"
: Matthews Correlation Coefficient (MCC)
Default:
["accuracy", "roc_auc"]
refit
: The metric used to refit the best model after hyperparameter tuning.Default:
"accuracy"
num_jobs
: The number of CPU cores used for training and hyperparameter search.Set to
k
to usek
CPU cores,-1
for all CPU cores,-k
for all butk
CPU cores.Default:
1
pre_dispatch
: Controls job pre-dispatching for parallel execution.Default:
"2*n_jobs"
verbose
: Controls verbosity of training output.Default:
0
random_state
: Seed for random number generators for reproducibility.Default:
None