loki2.cl.prepare_training
=========================

.. py:module:: loki2.cl.prepare_training

.. autoapi-nested-parse::

   Convert morphology/transcription embeddings into WebDataset shards for contrastive training.


Module Contents
---------------

.. py:data:: DTYPE_MAP

.. py:function:: parse_args(argv: Iterable[str] | None = None) -> argparse.Namespace

   Parse command-line arguments for building WebDataset shards.

   :param argv: Optional list of arguments to parse. If None, uses sys.argv.

   :returns: Parsed arguments.
   :rtype: argparse.Namespace


.. py:function:: cell_id_to_index(cell_id: str) -> int

   Extract row index from a cell ID string.

   :param cell_id: Cell ID string ending with a numeric index.

   :returns: Zero-based row index extracted from the cell ID.
   :rtype: int

   :raises ValueError: If no numeric suffix is found in the cell ID.


.. py:function:: tensor_to_bytes(tensor: torch.Tensor, dtype: torch.dtype) -> bytes

   Convert a PyTorch tensor to NumPy array bytes.

   :param tensor: PyTorch tensor to convert.
   :param dtype: Target dtype for the conversion.

   :returns: Serialized NumPy array as bytes.
   :rtype: bytes


.. py:function:: write_shard(dest: pathlib.Path, samples: List[Tuple[str, List[Tuple[str, bytes]]]]) -> None

   Write a WebDataset shard (tar file) containing samples.

   :param dest: Destination path for the tar shard file.
   :param samples: List of (key, payloads) tuples where payloads is a list
                   of (suffix, data) tuples.


.. py:function:: build_samples(dataset_name: str, morphological: torch.Tensor, morph_positions: torch.Tensor, transcription: torch.Tensor, ordered_indices: Sequence[int], ordered_cell_ids: Sequence[str], split: str, tensor_dtype: torch.dtype) -> Iterator[Tuple[str, List[Tuple[str, bytes]], Tuple[str, str, int, float, float]]]

   Build WebDataset samples from paired morphological and transcription embeddings.

   :param dataset_name: Name identifier for the dataset.
   :param morphological: Morphology embeddings tensor.
   :param morph_positions: Position coordinates for morphology embeddings.
   :param transcription: Transcription embeddings tensor.
   :param ordered_indices: Ordered row indices matching the transcription embeddings.
   :param ordered_cell_ids: Ordered cell IDs matching the transcription embeddings.
   :param split: Dataset split identifier ('train' or 'val').
   :param tensor_dtype: Data type for serializing tensors.

   :Yields: *Tuple containing* --     - key: Sample key string
                - payloads: List of (suffix, data) tuples for the shard
                - meta_row: Metadata tuple (key, cell_id, row_index, x, y, split)


.. py:function:: main(argv: Iterable[str] | None = None) -> None

   Main entry point for preparing WebDataset shards from embeddings.

   :param argv: Optional command-line arguments. If None, uses sys.argv.