loki2.mil.models.src.dataset
============================

.. py:module:: loki2.mil.models.src.dataset

.. autoapi-nested-parse::

   Dataset classes for Multiple Instance Learning (MIL) cell analysis.

   This module provides PyTorch Dataset classes for processing patient tile
   embeddings in a bag-of-instances format for MIL training and evaluation.


Module Contents
---------------

.. py:class:: PatientTileBagDataset_v2(dataframe: pandas.DataFrame, bag_size: int = 300, random_seed: int = 27)

   Bases: :py:obj:`torch.utils.data.Dataset`


   Dataset for patient tile bagging with fixed bag size.

   This dataset groups patient data by Patient_Bagging_ID and samples
   a fixed number of tiles (bag_size) from each group for training.

   .. attribute:: data

      Input DataFrame containing patient embeddings and labels.

   .. attribute:: bag_size

      Number of tiles to sample per bag. Defaults to 300.

   .. attribute:: rng

      Random number generator for reproducible sampling.

   .. attribute:: patient_groups

      Grouped DataFrame by Patient_Bagging_ID.

   .. attribute:: patient_ids

      List of unique Patient_Bagging_ID values.


   .. py:attribute:: data


   .. py:attribute:: bag_size


   .. py:attribute:: rng


   .. py:attribute:: patient_groups


   .. py:attribute:: patient_ids


.. py:class:: PatientTileBagDatasetForAttention_V7(dataframe: pandas.DataFrame, bag_size: int = 2000, random_seed: int = 42)

   Bases: :py:obj:`torch.utils.data.Dataset`


   Dataset for attention analysis with patient and bagging ID tracking.

   This dataset groups data by both Patient_ID and Patient_Bagging_ID,
   allowing for attention analysis across multiple bags per patient.

   .. attribute:: data

      Input DataFrame containing patient embeddings and labels.

   .. attribute:: bag_size

      Number of tiles to sample per bag. Defaults to 2000.

   .. attribute:: rng

      Random number generator for reproducible sampling.

   .. attribute:: group_cols

      List of column names to group by.

   .. attribute:: patient_groups

      Grouped DataFrame by Patient_ID and Patient_Bagging_ID.

   .. attribute:: keys

      List of unique (Patient_ID, Patient_Bagging_ID) tuples.


   .. py:attribute:: data


   .. py:attribute:: bag_size


   .. py:attribute:: rng


   .. py:attribute:: group_cols
      :value: ['Patient_ID', 'Patient_Bagging_ID']


   .. py:attribute:: patient_groups


   .. py:attribute:: keys


.. py:function:: dynamic_multiplication_factor(total_patches: int) -> int

   Calculate dynamic multiplication factor for bagging.

   The multiplication factor increases as the number of patches decreases,
   ensuring adequate bagging for patients with fewer patches.

   :param total_patches: Total number of patches for a patient.

   :returns: Multiplication factor between 2 and 5, rounded to nearest integer.
   :rtype: int


.. py:function:: create_bagging_dataframe_noreplacement_V12(df: pandas.DataFrame, BAG_SIZE: int, MIN_PATCHES_PER_WSI: int, MAX_PATCHES_PER_WSI: int, random_seed: int = 27) -> pandas.DataFrame

   Create bagging dataframe without replacement.

   Groups patients and creates bags of tiles, ensuring no tile is reused
   within the same patient until all tiles have been used. The number of
   bags per patient is determined by a dynamic multiplication factor.

   :param df: Input DataFrame with columns: Patient_ID, cell_position, and
              embedding features.
   :param BAG_SIZE: Number of tiles per bag.
   :param MIN_PATCHES_PER_WSI: Minimum number of patches to use per WSI.
   :param MAX_PATCHES_PER_WSI: Maximum number of patches to use per WSI.
   :param random_seed: Random seed for reproducibility. Defaults to 27.

   :returns:

             DataFrame with added Patient_Bagging_ID column,
                 containing bagged tile data.
   :rtype: pd.DataFrame


.. py:function:: create_bagging_dataframe_All_v7(df: pandas.DataFrame, random_seed: int = 27) -> pandas.DataFrame

   Create bagging dataframe for attention analysis.

   Creates a single bag per patient by setting Patient_Bagging_ID to
   the Patient_ID. This is used for full attention analysis without
   bagging.

   :param df: Input DataFrame with columns: Patient_ID, Patient_Label, and
              embedding features.
   :param random_seed: Random seed for reproducibility (unused but kept for
                       consistency). Defaults to 27.

   :returns:

             DataFrame with Patient_Bagging_ID set to Patient_ID
                 for each row.
   :rtype: pd.DataFrame


.. py:function:: custom_collate_fn(batch: List[Tuple[torch.Tensor, torch.Tensor, str, str, numpy.ndarray]]) -> Tuple[torch.Tensor, torch.Tensor, List[str], List[str], List[numpy.ndarray]]

   Custom collate function for attention analysis dataloader.

   Stacks bag tensors and label tensors, and collects metadata lists
   (patient IDs, bagging IDs, cell positions) without stacking.

   :param batch: List of tuples from PatientTileBagDatasetForAttention_V7,
                 each containing (bag_tensor, label_tensor, patient_id,
                 bagging_id, cell_positions).

   :returns:

                 - bag_tensor_cat: Stacked bag tensors of shape
                   (batch_size, bag_size, embedding_dim).
                 - label_tensor_cat: Stacked label tensors of shape
                   (batch_size, 1).
                 - patient_id_list: List of patient ID strings.
                 - bagging_id_list: List of bagging ID strings.
                 - cell_positions_list: List of NumPy arrays of cell positions.
   :rtype: Tuple containing


.. py:function:: prepare_data_for_attention_dataset(dataframe: pandas.DataFrame) -> pandas.DataFrame

   Prepare data for attention dataset by adding necessary columns.

   Ensures the DataFrame has all required columns for attention analysis,
   including Patient_Bagging_ID if missing.

   :param dataframe: Input DataFrame with patient data.

   :returns:

             DataFrame with all required columns, including
                 Patient_Bagging_ID if it was missing.
   :rtype: pd.DataFrame

   :raises ValueError: If required columns (Patient_ID, Patient_Bagging_ID,
       Patient_Label, cell_position) are missing.


.. py:function:: create_balanced_fold_assignments(patient_ids: numpy.ndarray, patient_labels: numpy.ndarray, n_folds: int, seed: int = 27) -> List[Dict[str, List[str]]]

   Create balanced fold assignments.

   Ensures each fold has both positive and negative samples by manually
   distributing patients across folds.

   :param patient_ids: Array of patient IDs.
   :param patient_labels: Array of patient labels (0/1).
   :param n_folds: Number of folds for cross-validation.
   :param seed: Random seed for reproducibility. Defaults to 27.

   :returns:

             List of dictionaries, one per fold.
                 Each dictionary contains:
                     - 'positive': List of positive patient IDs.
                     - 'negative': List of negative patient IDs.
   :rtype: List[Dict[str, List[str]]]


.. py:function:: create_balanced_train_val_split(train_patient_ids: numpy.ndarray, train_patient_labels: numpy.ndarray, val_size: float, seed: int = 27) -> Tuple[numpy.ndarray, numpy.ndarray]

   Create balanced train/val split.

   Ensures both training and validation sets have positive and negative
   samples by manually selecting at least one of each class for validation.

   :param train_patient_ids: Array of training patient IDs.
   :param train_patient_labels: Array of training patient labels (0/1).
   :param val_size: Validation set ratio (target fraction of patients).
   :param seed: Random seed for reproducibility. Defaults to 27.

   :returns:

             A tuple containing:
                 - train_patients_inner: Array of inner training set patient IDs.
                 - val_patients_inner: Array of inner validation set patient IDs.
   :rtype: Tuple[np.ndarray, np.ndarray]