loki2.encode_trans

Transcriptomics encoding module.

This module provides functionality to encode transcriptomics data using pre-trained models for cross-modal analysis.

Module Contents

loki2.encode_trans.encode_transcriptomics(ad_path: pathlib.Path, output_path: pathlib.Path, model_path: pathlib.Path, housekeeping_path: pathlib.Path, batch_size: int = 100, num_threads: int | None = None, device: str = 'cpu') → None

Encode transcriptomics data using a pre-trained model.

This function processes AnnData objects containing single-cell RNA-seq data, generates gene expression prompts, and encodes them using a pre-trained CLIP model for cross-modal analysis.

Parameters:

ad_path – Path to the input AnnData (.h5ad) file.
output_path – Path where encoded embeddings will be saved (.pt file).
model_path – Path to the pre-trained model checkpoint.
housekeeping_path – Path to CSV file containing housekeeping genes.
batch_size – Batch size for encoding. Defaults to 100.
num_threads – Number of threads for PyTorch operations. If None, uses default. Defaults to None.
device – Device to use for encoding (‘cpu’ or ‘cuda’). Defaults to ‘cpu’.

Raises:

ValueError – If observation names are not unique, if duplicate or missing cell identifiers are found, or if num_threads is invalid.

loki2.encode_trans.load_model(model_path: pathlib.Path, device: str = 'cuda') → Tuple[Any, Any, Any]

Load a pre-trained CoCa model and tokenizer.

Parameters:

model_path – Path to the pre-trained model checkpoint.
device – Device to load the model on (‘cpu’ or ‘cuda’). Defaults to ‘cuda’.

Returns:

Tuple containing:

model: Loaded CoCa model
preprocess: Preprocessing function for images
tokenizer: Text tokenizer

Return type:

Tuple[Any, Any, Any]

loki2.encode_trans.load_prompts_csv(csv_path: pathlib.Path) → pandas.DataFrame

Load gene prompts from a CSV file.

Parameters:: csv_path – Path to the CSV file containing gene prompts.
Returns:: DataFrame with ‘cell_id’ and ‘label’ columns.
Return type:: pd.DataFrame
Raises:: ValueError – If required columns are missing.

loki2.encode_trans.generate_gene_df(ad: Any, house_keeping_genes: pandas.DataFrame, todense: bool = True) → pandas.DataFrame

Generate a DataFrame with the top 50 genes for each observation.

Removes genes containing ‘.’ or ‘-’ in their names, as well as genes listed in the housekeeping genes DataFrame.

Parameters:

ad – AnnData object containing gene expression data.
house_keeping_genes – DataFrame with a ‘genesymbol’ column listing housekeeping genes to exclude.
todense – Whether to convert the sparse matrix (ad.X) to a dense matrix before creating a DataFrame. Defaults to True.

Returns:

DataFrame with two columns: ‘cell_id’ and ‘label’.: Each label entry is a string with the top 50 gene names (space-separated) for that observation.

Return type:

pd.DataFrame

loki2.encode_trans.encode_texts(model, tokenizer, texts, batch_size=256, device='cuda')

loki2.encode_trans.build_parser() → argparse.ArgumentParser

loki2.encode_trans.main(argv: Iterable[str] = None) → None