loki2.encode_trans
Transcriptomics encoding module.
This module provides functionality to encode transcriptomics data using pre-trained models for cross-modal analysis.
Module Contents
- loki2.encode_trans.encode_transcriptomics(ad_path: pathlib.Path, output_path: pathlib.Path, model_path: pathlib.Path, housekeeping_path: pathlib.Path, batch_size: int = 100, num_threads: int | None = None, device: str = 'cpu') None
Encode transcriptomics data using a pre-trained model.
This function processes AnnData objects containing single-cell RNA-seq data, generates gene expression prompts, and encodes them using a pre-trained CLIP model for cross-modal analysis.
- Parameters:
ad_path – Path to the input AnnData (.h5ad) file.
output_path – Path where encoded embeddings will be saved (.pt file).
model_path – Path to the pre-trained model checkpoint.
housekeeping_path – Path to CSV file containing housekeeping genes.
batch_size – Batch size for encoding. Defaults to 100.
num_threads – Number of threads for PyTorch operations. If None, uses default. Defaults to None.
device – Device to use for encoding (‘cpu’ or ‘cuda’). Defaults to ‘cpu’.
- Raises:
ValueError – If observation names are not unique, if duplicate or missing cell identifiers are found, or if num_threads is invalid.
- loki2.encode_trans.load_model(model_path: pathlib.Path, device: str = 'cuda') Tuple[Any, Any, Any]
Load a pre-trained CoCa model and tokenizer.
- Parameters:
model_path – Path to the pre-trained model checkpoint.
device – Device to load the model on (‘cpu’ or ‘cuda’). Defaults to ‘cuda’.
- Returns:
- Tuple containing:
model: Loaded CoCa model
preprocess: Preprocessing function for images
tokenizer: Text tokenizer
- Return type:
Tuple[Any, Any, Any]
- loki2.encode_trans.load_prompts_csv(csv_path: pathlib.Path) pandas.DataFrame
Load gene prompts from a CSV file.
- Parameters:
csv_path – Path to the CSV file containing gene prompts.
- Returns:
DataFrame with ‘cell_id’ and ‘label’ columns.
- Return type:
pd.DataFrame
- Raises:
ValueError – If required columns are missing.
- loki2.encode_trans.generate_gene_df(ad: Any, house_keeping_genes: pandas.DataFrame, todense: bool = True) pandas.DataFrame
Generate a DataFrame with the top 50 genes for each observation.
Removes genes containing ‘.’ or ‘-’ in their names, as well as genes listed in the housekeeping genes DataFrame.
- Parameters:
ad – AnnData object containing gene expression data.
house_keeping_genes – DataFrame with a ‘genesymbol’ column listing housekeeping genes to exclude.
todense – Whether to convert the sparse matrix (ad.X) to a dense matrix before creating a DataFrame. Defaults to True.
- Returns:
- DataFrame with two columns: ‘cell_id’ and ‘label’.
Each label entry is a string with the top 50 gene names (space-separated) for that observation.
- Return type:
pd.DataFrame
- loki2.encode_trans.encode_texts(model, tokenizer, texts, batch_size=256, device='cuda')
- loki2.encode_trans.build_parser() argparse.ArgumentParser