loki2.cl.prepare_training

Convert morphology/transcription embeddings into WebDataset shards for contrastive training.

Module Contents

loki2.cl.prepare_training.DTYPE_MAP
loki2.cl.prepare_training.parse_args(argv: Iterable[str] | None = None) argparse.Namespace

Parse command-line arguments for building WebDataset shards.

Parameters:

argv – Optional list of arguments to parse. If None, uses sys.argv.

Returns:

Parsed arguments.

Return type:

argparse.Namespace

loki2.cl.prepare_training.cell_id_to_index(cell_id: str) int

Extract row index from a cell ID string.

Parameters:

cell_id – Cell ID string ending with a numeric index.

Returns:

Zero-based row index extracted from the cell ID.

Return type:

int

Raises:

ValueError – If no numeric suffix is found in the cell ID.

loki2.cl.prepare_training.tensor_to_bytes(tensor: torch.Tensor, dtype: torch.dtype) bytes

Convert a PyTorch tensor to NumPy array bytes.

Parameters:
  • tensor – PyTorch tensor to convert.

  • dtype – Target dtype for the conversion.

Returns:

Serialized NumPy array as bytes.

Return type:

bytes

loki2.cl.prepare_training.write_shard(dest: pathlib.Path, samples: List[Tuple[str, List[Tuple[str, bytes]]]]) None

Write a WebDataset shard (tar file) containing samples.

Parameters:
  • dest – Destination path for the tar shard file.

  • samples – List of (key, payloads) tuples where payloads is a list of (suffix, data) tuples.

loki2.cl.prepare_training.build_samples(dataset_name: str, morphological: torch.Tensor, morph_positions: torch.Tensor, transcription: torch.Tensor, ordered_indices: Sequence[int], ordered_cell_ids: Sequence[str], split: str, tensor_dtype: torch.dtype) Iterator[Tuple[str, List[Tuple[str, bytes]], Tuple[str, str, int, float, float]]]

Build WebDataset samples from paired morphological and transcription embeddings.

Parameters:
  • dataset_name – Name identifier for the dataset.

  • morphological – Morphology embeddings tensor.

  • morph_positions – Position coordinates for morphology embeddings.

  • transcription – Transcription embeddings tensor.

  • ordered_indices – Ordered row indices matching the transcription embeddings.

  • ordered_cell_ids – Ordered cell IDs matching the transcription embeddings.

  • split – Dataset split identifier (‘train’ or ‘val’).

  • tensor_dtype – Data type for serializing tensors.

Yields:

Tuple containing – - key: Sample key string - payloads: List of (suffix, data) tuples for the shard - meta_row: Metadata tuple (key, cell_id, row_index, x, y, split)

loki2.cl.prepare_training.main(argv: Iterable[str] | None = None) None

Main entry point for preparing WebDataset shards from embeddings.

Parameters:

argv – Optional command-line arguments. If None, uses sys.argv.