loki2.mil.models.src.dataset

Dataset classes for Multiple Instance Learning (MIL) cell analysis.

This module provides PyTorch Dataset classes for processing patient tile embeddings in a bag-of-instances format for MIL training and evaluation.

Module Contents

class loki2.mil.models.src.dataset.PatientTileBagDataset_v2(dataframe: pandas.DataFrame, bag_size: int = 300, random_seed: int = 27)

Bases: torch.utils.data.Dataset

Dataset for patient tile bagging with fixed bag size.

This dataset groups patient data by Patient_Bagging_ID and samples a fixed number of tiles (bag_size) from each group for training.

data: Input DataFrame containing patient embeddings and labels.

bag_size: Number of tiles to sample per bag. Defaults to 300.

rng: Random number generator for reproducible sampling.

patient_groups: Grouped DataFrame by Patient_Bagging_ID.

patient_ids: List of unique Patient_Bagging_ID values.

data

bag_size

rng

patient_groups

patient_ids

class loki2.mil.models.src.dataset.PatientTileBagDatasetForAttention_V7(dataframe: pandas.DataFrame, bag_size: int = 2000, random_seed: int = 42)

Bases: torch.utils.data.Dataset

Dataset for attention analysis with patient and bagging ID tracking.

This dataset groups data by both Patient_ID and Patient_Bagging_ID, allowing for attention analysis across multiple bags per patient.

data: Input DataFrame containing patient embeddings and labels.

bag_size: Number of tiles to sample per bag. Defaults to 2000.

rng: Random number generator for reproducible sampling.

group_cols: List of column names to group by.

patient_groups: Grouped DataFrame by Patient_ID and Patient_Bagging_ID.

keys: List of unique (Patient_ID, Patient_Bagging_ID) tuples.

data

bag_size

rng

group_cols = ['Patient_ID', 'Patient_Bagging_ID']

patient_groups

keys

loki2.mil.models.src.dataset.dynamic_multiplication_factor(total_patches: int) → int

Calculate dynamic multiplication factor for bagging.

The multiplication factor increases as the number of patches decreases, ensuring adequate bagging for patients with fewer patches.

Parameters:: total_patches – Total number of patches for a patient.
Returns:: Multiplication factor between 2 and 5, rounded to nearest integer.
Return type:: int

loki2.mil.models.src.dataset.create_bagging_dataframe_noreplacement_V12(df: pandas.DataFrame, BAG_SIZE: int, MIN_PATCHES_PER_WSI: int, MAX_PATCHES_PER_WSI: int, random_seed: int = 27) → pandas.DataFrame

Create bagging dataframe without replacement.

Groups patients and creates bags of tiles, ensuring no tile is reused within the same patient until all tiles have been used. The number of bags per patient is determined by a dynamic multiplication factor.

Parameters:

df – Input DataFrame with columns: Patient_ID, cell_position, and embedding features.
BAG_SIZE – Number of tiles per bag.
MIN_PATCHES_PER_WSI – Minimum number of patches to use per WSI.
MAX_PATCHES_PER_WSI – Maximum number of patches to use per WSI.
random_seed – Random seed for reproducibility. Defaults to 27.

Returns:

DataFrame with added Patient_Bagging_ID column,: containing bagged tile data.

Return type:

pd.DataFrame

loki2.mil.models.src.dataset.create_bagging_dataframe_All_v7(df: pandas.DataFrame, random_seed: int = 27) → pandas.DataFrame

Create bagging dataframe for attention analysis.

Creates a single bag per patient by setting Patient_Bagging_ID to the Patient_ID. This is used for full attention analysis without bagging.

Parameters:

df – Input DataFrame with columns: Patient_ID, Patient_Label, and embedding features.
random_seed – Random seed for reproducibility (unused but kept for consistency). Defaults to 27.

Returns:

DataFrame with Patient_Bagging_ID set to Patient_ID: for each row.

Return type:

pd.DataFrame

loki2.mil.models.src.dataset.custom_collate_fn(batch: List[Tuple[torch.Tensor, torch.Tensor, str, str, numpy.ndarray]]) → Tuple[torch.Tensor, torch.Tensor, List[str], List[str], List[numpy.ndarray]]

Custom collate function for attention analysis dataloader.

Stacks bag tensors and label tensors, and collects metadata lists (patient IDs, bagging IDs, cell positions) without stacking.

Parameters:

batch – List of tuples from PatientTileBagDatasetForAttention_V7, each containing (bag_tensor, label_tensor, patient_id, bagging_id, cell_positions).

Returns:

bag_tensor_cat: Stacked bag tensors of shape (batch_size, bag_size, embedding_dim).
label_tensor_cat: Stacked label tensors of shape (batch_size, 1).
patient_id_list: List of patient ID strings.
bagging_id_list: List of bagging ID strings.
cell_positions_list: List of NumPy arrays of cell positions.

Return type:

Tuple containing

loki2.mil.models.src.dataset.prepare_data_for_attention_dataset(dataframe: pandas.DataFrame) → pandas.DataFrame

Prepare data for attention dataset by adding necessary columns.

Ensures the DataFrame has all required columns for attention analysis, including Patient_Bagging_ID if missing.

Parameters:

dataframe – Input DataFrame with patient data.

Returns:

DataFrame with all required columns, including: Patient_Bagging_ID if it was missing.

Return type:

pd.DataFrame

Raises:

ValueError – If required columns (Patient_ID, Patient_Bagging_ID, Patient_Label, cell_position) are missing.

loki2.mil.models.src.dataset.create_balanced_fold_assignments(patient_ids: numpy.ndarray, patient_labels: numpy.ndarray, n_folds: int, seed: int = 27) → List[Dict[str, List[str]]]

Create balanced fold assignments.

Ensures each fold has both positive and negative samples by manually distributing patients across folds.

Parameters:

patient_ids – Array of patient IDs.
patient_labels – Array of patient labels (0/1).
n_folds – Number of folds for cross-validation.
seed – Random seed for reproducibility. Defaults to 27.

Returns:

List of dictionaries, one per fold.

Each dictionary contains:

’positive’: List of positive patient IDs.
’negative’: List of negative patient IDs.

Return type:

List[Dict[str, List[str]]]

loki2.mil.models.src.dataset.create_balanced_train_val_split(train_patient_ids: numpy.ndarray, train_patient_labels: numpy.ndarray, val_size: float, seed: int = 27) → Tuple[numpy.ndarray, numpy.ndarray]

Create balanced train/val split.

Ensures both training and validation sets have positive and negative samples by manually selecting at least one of each class for validation.

Parameters:

train_patient_ids – Array of training patient IDs.
train_patient_labels – Array of training patient labels (0/1).
val_size – Validation set ratio (target fraction of patients).
seed – Random seed for reproducibility. Defaults to 27.

Returns:

A tuple containing:

train_patients_inner: Array of inner training set patient IDs.
val_patients_inner: Array of inner validation set patient IDs.

Return type:

Tuple[np.ndarray, np.ndarray]