loki2.cl.prepare_training
Convert morphology/transcription embeddings into WebDataset shards for contrastive training.
Module Contents
- loki2.cl.prepare_training.DTYPE_MAP
- loki2.cl.prepare_training.parse_args(argv: Iterable[str] | None = None) argparse.Namespace
Parse command-line arguments for building WebDataset shards.
- Parameters:
argv – Optional list of arguments to parse. If None, uses sys.argv.
- Returns:
Parsed arguments.
- Return type:
argparse.Namespace
- loki2.cl.prepare_training.cell_id_to_index(cell_id: str) int
Extract row index from a cell ID string.
- Parameters:
cell_id – Cell ID string ending with a numeric index.
- Returns:
Zero-based row index extracted from the cell ID.
- Return type:
int
- Raises:
ValueError – If no numeric suffix is found in the cell ID.
- loki2.cl.prepare_training.tensor_to_bytes(tensor: torch.Tensor, dtype: torch.dtype) bytes
Convert a PyTorch tensor to NumPy array bytes.
- Parameters:
tensor – PyTorch tensor to convert.
dtype – Target dtype for the conversion.
- Returns:
Serialized NumPy array as bytes.
- Return type:
bytes
- loki2.cl.prepare_training.write_shard(dest: pathlib.Path, samples: List[Tuple[str, List[Tuple[str, bytes]]]]) None
Write a WebDataset shard (tar file) containing samples.
- Parameters:
dest – Destination path for the tar shard file.
samples – List of (key, payloads) tuples where payloads is a list of (suffix, data) tuples.
- loki2.cl.prepare_training.build_samples(dataset_name: str, morphological: torch.Tensor, morph_positions: torch.Tensor, transcription: torch.Tensor, ordered_indices: Sequence[int], ordered_cell_ids: Sequence[str], split: str, tensor_dtype: torch.dtype) Iterator[Tuple[str, List[Tuple[str, bytes]], Tuple[str, str, int, float, float]]]
Build WebDataset samples from paired morphological and transcription embeddings.
- Parameters:
dataset_name – Name identifier for the dataset.
morphological – Morphology embeddings tensor.
morph_positions – Position coordinates for morphology embeddings.
transcription – Transcription embeddings tensor.
ordered_indices – Ordered row indices matching the transcription embeddings.
ordered_cell_ids – Ordered cell IDs matching the transcription embeddings.
split – Dataset split identifier (‘train’ or ‘val’).
tensor_dtype – Data type for serializing tensors.
- Yields:
Tuple containing – - key: Sample key string - payloads: List of (suffix, data) tuples for the shard - meta_row: Metadata tuple (key, cell_id, row_index, x, y, split)