loki2.encode_trans

Transcriptomics encoding module.

This module provides functionality to encode transcriptomics data using pre-trained models for cross-modal analysis.

Module Contents

loki2.encode_trans.encode_transcriptomics(ad_path: pathlib.Path, output_path: pathlib.Path, model_path: pathlib.Path, housekeeping_path: pathlib.Path, batch_size: int = 100, num_threads: int | None = None, device: str = 'cpu') None

Encode transcriptomics data using a pre-trained model.

This function processes AnnData objects containing single-cell RNA-seq data, generates gene expression prompts, and encodes them using a pre-trained CLIP model for cross-modal analysis.

Parameters:
  • ad_path – Path to the input AnnData (.h5ad) file.

  • output_path – Path where encoded embeddings will be saved (.pt file).

  • model_path – Path to the pre-trained model checkpoint.

  • housekeeping_path – Path to CSV file containing housekeeping genes.

  • batch_size – Batch size for encoding. Defaults to 100.

  • num_threads – Number of threads for PyTorch operations. If None, uses default. Defaults to None.

  • device – Device to use for encoding (‘cpu’ or ‘cuda’). Defaults to ‘cpu’.

Raises:

ValueError – If observation names are not unique, if duplicate or missing cell identifiers are found, or if num_threads is invalid.

loki2.encode_trans.load_model(model_path: pathlib.Path, device: str = 'cuda') Tuple[Any, Any, Any]

Load a pre-trained CoCa model and tokenizer.

Parameters:
  • model_path – Path to the pre-trained model checkpoint.

  • device – Device to load the model on (‘cpu’ or ‘cuda’). Defaults to ‘cuda’.

Returns:

Tuple containing:
  • model: Loaded CoCa model

  • preprocess: Preprocessing function for images

  • tokenizer: Text tokenizer

Return type:

Tuple[Any, Any, Any]

loki2.encode_trans.load_prompts_csv(csv_path: pathlib.Path) pandas.DataFrame

Load gene prompts from a CSV file.

Parameters:

csv_path – Path to the CSV file containing gene prompts.

Returns:

DataFrame with ‘cell_id’ and ‘label’ columns.

Return type:

pd.DataFrame

Raises:

ValueError – If required columns are missing.

loki2.encode_trans.generate_gene_df(ad: Any, house_keeping_genes: pandas.DataFrame, todense: bool = True) pandas.DataFrame

Generate a DataFrame with the top 50 genes for each observation.

Removes genes containing ‘.’ or ‘-’ in their names, as well as genes listed in the housekeeping genes DataFrame.

Parameters:
  • ad – AnnData object containing gene expression data.

  • house_keeping_genes – DataFrame with a ‘genesymbol’ column listing housekeeping genes to exclude.

  • todense – Whether to convert the sparse matrix (ad.X) to a dense matrix before creating a DataFrame. Defaults to True.

Returns:

DataFrame with two columns: ‘cell_id’ and ‘label’.

Each label entry is a string with the top 50 gene names (space-separated) for that observation.

Return type:

pd.DataFrame

loki2.encode_trans.encode_texts(model, tokenizer, texts, batch_size=256, device='cuda')
loki2.encode_trans.build_parser() argparse.ArgumentParser
loki2.encode_trans.main(argv: Iterable[str] = None) None