loki2.cl.prepare_training

Convert morphology/transcription embeddings into WebDataset shards for contrastive training.

Module Contents

loki2.cl.prepare_training.DTYPE_MAP

loki2.cl.prepare_training.parse_args(argv: Iterable[str] | None = None) → argparse.Namespace

Parse command-line arguments for building WebDataset shards.

Parameters:: argv – Optional list of arguments to parse. If None, uses sys.argv.
Returns:: Parsed arguments.
Return type:: argparse.Namespace

loki2.cl.prepare_training.cell_id_to_index(cell_id: str) → int

Extract row index from a cell ID string.

Parameters:: cell_id – Cell ID string ending with a numeric index.
Returns:: Zero-based row index extracted from the cell ID.
Return type:: int
Raises:: ValueError – If no numeric suffix is found in the cell ID.

loki2.cl.prepare_training.tensor_to_bytes(tensor: torch.Tensor, dtype: torch.dtype) → bytes

Convert a PyTorch tensor to NumPy array bytes.

Parameters:

tensor – PyTorch tensor to convert.
dtype – Target dtype for the conversion.

Returns:

Serialized NumPy array as bytes.

Return type:

bytes

loki2.cl.prepare_training.write_shard(dest: pathlib.Path, samples: List[Tuple[str, List[Tuple[str, bytes]]]]) → None

Write a WebDataset shard (tar file) containing samples.

Parameters:

dest – Destination path for the tar shard file.
samples – List of (key, payloads) tuples where payloads is a list of (suffix, data) tuples.

loki2.cl.prepare_training.build_samples(dataset_name: str, morphological: torch.Tensor, morph_positions: torch.Tensor, transcription: torch.Tensor, ordered_indices: Sequence[int], ordered_cell_ids: Sequence[str], split: str, tensor_dtype: torch.dtype) → Iterator[Tuple[str, List[Tuple[str, bytes]], Tuple[str, str, int, float, float]]]

Build WebDataset samples from paired morphological and transcription embeddings.

Parameters:

dataset_name – Name identifier for the dataset.
morphological – Morphology embeddings tensor.
morph_positions – Position coordinates for morphology embeddings.
transcription – Transcription embeddings tensor.
ordered_indices – Ordered row indices matching the transcription embeddings.
ordered_cell_ids – Ordered cell IDs matching the transcription embeddings.
split – Dataset split identifier (‘train’ or ‘val’).
tensor_dtype – Data type for serializing tensors.

Yields:

Tuple containing – - key: Sample key string - payloads: List of (suffix, data) tuples for the shard - meta_row: Metadata tuple (key, cell_id, row_index, x, y, split)

loki2.cl.prepare_training.main(argv: Iterable[str] | None = None) → None

Main entry point for preparing WebDataset shards from embeddings.

Parameters:: argv – Optional command-line arguments. If None, uses sys.argv.