Data and Configuration Arguments#
Configuration Arguments for Data Loading#
When using configuration files, remember that all parameter names must be specified in uppercase per yacs convention.
The available argument we mainly focused on includes:
data_dir: Local directory to store and load the dataset. If files are missing, they will be automatically downloaded.Default: Current working directory +
/data
atlas: The name of the brain atlas used to extract ROI time series. This corresponds to a subfolder insidefc/.Available options:
"aal": AAL Atlas"cc200": Craddock 200 ROI Atlas"cc400": Craddock 400 ROI Atlas"difumo64": DiFuMo 64 components"dos160": Dosenbach 160 Atlas"hcp-ica": HCP ICA-based components"ho": Harvard-Oxford Atlas"tt": Talairach-Tournoux
Default:
"cc200"
fc: The type of functional connectivity embedding to load (file name without extension).Available options:
"pearson": Pearson correlation"partial": Partial correlation"tangent": Tangent embedding"precision": Precision (inverse covariance)"covariance": Sample covariance"tangent-pearson": Hybrid of tangent embedding and Pearson correlation
Default:
"tangent-pearson"
top_k_sites: Optionally restrict the dataset to the top K sites (by number of subjects). IfNone, all sites are included.Default:
None
It returns four values, including:
fc_data(np.ndarray): Functional connectivity data (vectorized ifvectorize=True).phenotypes(pd.DataFrame): Associated phenotypic information (e.g., site, age, gender).rois(np.ndarray): ROI labels associated with the selected atlas.coords(np.ndarray): ROI coordinates for visualization purposes.
Categorical Variables from Phenotypic Data#
The following categorical phenotypes are included and will be one-hot encoded for modelling:
SITE_IDSEXHANDEDNESS_CATEGORYEYE_STATUS_AT_SCAN
These variables are first mapped to descriptive labels using the provided MAPPING dictionary:
SEX:{1 → MALE, 2 → FEMALE}HANDEDNESS_CATEGORY: Includes various representations unified into:"RIGHT"(including missing values and-9999)"LEFT""AMBIDEXTROUS"(e.g.,"Mixed","L->R","Ambi")
EYE_STATUS_AT_SCAN:{1 → OPEN, 2 → CLOSED}
Continuous Variables#
The following continuous phenotypes will be optionally standardized:
AGE_AT_SCANFIQ
We will explain the available options for standardizing these phenotypes in more detail down below.
Handling Missing Values#
Missing values are handled with the following assumptions and imputation strategies:
HANDEDNESS_CATEGORY: Missing entries (-9999orNaN) are imputed as"RIGHT".FIQ: Missing scores (-9999) are imputed with a default value of100.
These choices ensure that the downstream models can operate without interruption while maintaining reasonable assumptions based on domain knowledge.
Target Variable Encoding#
The diagnostic group DX_GROUP is used to define the target label for classification:
CONTROL→0ASD→1
This binary label is suitable for supervised learning tasks focused on ASD detection.
To do this, the preprocess_phenotypic_data function handles this functionality for us.
The main arguments for preprocess_phenotypic_data include:
data: A DataFrame containing the phenotypic information for each subject. Must include all selected phenotypes such asSEX,AGE_AT_SCAN,FIQ,HANDEDNESS_CATEGORY,EYE_STATUS_AT_SCAN, andDX_GROUP.Type:
pd.DataFrameof shape(n_subjects, n_phenotypes)Required
standardize: Whether to standardize continuous variables (AGE_AT_SCANandFIQ). This helps remove scale-related bias before modeling.Available options:
False: No standardization (raw values retained)Trueor"all": Standardize across all subjects"site": Standardize within each acquisition site
Default:
False
one_hot_encode: Whether to one-hot encode categorical variables (SITE_ID,SEX,HANDEDNESS_CATEGORY,EYE_STATUS_AT_SCAN). This is typically used when feeding the data into machine learning models.Type:
boolDefault:
True
The function returns the following:
labels(array-like): The encoded diagnostic labels derived fromDX_GROUP.0: CONTROL1: ASDShape:
(n_subjects,)
sites(array-like): Site identifiers corresponding to each subject, useful for site-wise stratification or harmonization.Shape:
(n_subjects,)
phenotypes(pd.DataFrame): The cleaned and processed phenotype DataFrame with missing values imputed, categorical variables mapped (and optionally one-hot encoded), and continuous variables optionally standardized.Shape:
(n_subjects, n_selected_phenotypes)Note: The selected phenotypes include:
SITE_IDSEXAGE_AT_SCANFIQHANDEDNESS_CATEGORYEYE_STATUS_AT_SCAN
Configuration Arguments for Cross-Validation#
In this tutorial, we specify the following arguments for cross-validation:
split: Defines the cross-validation strategy.Available options:
"skf": Stratified K-fold to maintain label balance in each fold."lpgo": Leave p-groups out to evaluate generalization across sites by holding out entire groups (e.g., imaging sites).
Default:
"skf"
num_folds: The number of folds for"skf"or the number of groups to leave out in"lpgo".Default:
10
num_repeats: The number of times the k-fold procedure is repeated to obtain more stable estimates (ignored with"lpgo").Default:
5
random_state: Seed for random number generators for reproducibility.Default:
None
Hyperparameter Grid#
We also specify the hyperparameter search strategy and other training parameters for each configuration, including:
classifier: The base model used for classification.Available options:
"lda": Linear Discriminant Analysis"lr": Logistic Regression"linear_svm": Linear Support Vector Machine"svm": Kernel Support Vector Machine"ridge": Ridge Classifier (L2-regularized linear model)"auto": Automatically selects an appropriate model based on data characteristics.
Default:
"lr"
param_grid: The hyperparameter grid used for both the classifier and the MIDA domain adapter.To specify MIDA’s parameters, each key in the grid must be prefixed with
domain_adapter__(e.g.,domain_adapter__mu).For classifier parameters, no prefix is needed.
If
param_gridis set toNone, PyKale will use its default grid, which spans a broad hyperparameter search space. While this may maximize performance, it significantly increases training time.Therefore, it is not recommended to use
param_grid=Nonein combination withsearch_strategy='grid'.Default:
None
nonlinear: Whether to apply non-linear transformations (non-interpretable).Type:
booleanDefault:
False
search_strategy: The hyperparameter search method.Available options:
"random": Randomly search over finite iterations."grid": Search over all possible combinations.
Default:
"random"
num_search_iterations: The number of hyperparameter combinations to evaluate in randomized search.Default:
1,000
num_solver_iterations: The maximum number of iterations allowed for solver convergence.Default:
1,000,000
scoring: A list of performance metrics used during cross-validation.Available options:
"accuracy": Accuracy"precision": Precision"recall": Recall"f1": F1-Score"roc_auc": Area Under ROC Curve (AUROC)"matthews_corrcoef": Matthews Correlation Coefficient (MCC)
Default:
["accuracy", "roc_auc"]
refit: The metric used to refit the best model after hyperparameter tuning.Default:
"accuracy"
num_jobs: The number of CPU cores used for training and hyperparameter search.Set to
kto usekCPU cores,-1for all CPU cores,-kfor all butkCPU cores.Default:
1
pre_dispatch: Controls job pre-dispatching for parallel execution.Default:
"2*n_jobs"
verbose: Controls verbosity of training output.Default:
0
random_state: Seed for random number generators for reproducibility.Default:
None