loki2.immstain ============== .. py:module:: loki2.immstain .. autoapi-nested-parse:: in silico immunostaing using LightGBM for HE->IF protein prediction. 1) Split global nuclei JSON + embeddings into per-patch files. 2) Build patch datasets with optional watershed-based cell expansion. 3) Train a LightGBM regressor and evaluate on held-out patches. 4) Save per-cell predictions and make simple diagnostic plots. Module Contents --------------- .. py:class:: TutorialConfig Configuration dataclass for LightGBM tutorial pipeline. .. attribute:: he_ome_path Path to H&E OME-TIFF file. .. attribute:: global_nuclei_json Path to global nuclei JSON file. .. attribute:: global_embedding_pt Path to global cell embeddings .pt file. .. attribute:: patch_json_dir Directory to save patch-specific JSON files. .. attribute:: patch_emb_dir Directory to save patch-specific embedding files. .. attribute:: if_patch_dir Directory containing IF (immunofluorescence) patch images. .. attribute:: if_patch_pattern Pattern string for IF patch filenames with {row:03d} and {col:03d} placeholders. Defaults to "IF_sel_r{row:03d}_c{col:03d}.tif". .. attribute:: level0_idx Index of level 0 in OME-TIFF pyramid. Defaults to 0. .. attribute:: level1_idx Index of level 1 in OME-TIFF pyramid. Defaults to 1. .. attribute:: patch_size Size of patches in pixels. Defaults to 2048. .. py:attribute:: he_ome_path :type: pathlib.Path .. py:attribute:: global_nuclei_json :type: pathlib.Path .. py:attribute:: global_embedding_pt :type: pathlib.Path .. py:attribute:: patch_json_dir :type: pathlib.Path .. py:attribute:: patch_emb_dir :type: pathlib.Path .. py:attribute:: if_patch_dir :type: pathlib.Path .. py:attribute:: if_patch_pattern :type: str :value: 'IF_sel_r{row:03d}_c{col:03d}.tif' .. py:attribute:: level0_idx :type: int :value: 0 .. py:attribute:: level1_idx :type: int :value: 1 .. py:attribute:: patch_size :type: int :value: 2048 .. py:function:: timed_section(label: str) -> Generator[None, None, None] Context manager to time a code section. :param label: Label string to display with the timing output. :Yields: *None* -- Context manager yields control to the code block. .. rubric:: Example >>> with timed_section("Processing data"): ... process_data() [timer] Processing data: 1.23s .. py:function:: parse_patch_index(name: str) -> Tuple[int, int] Parse row and column indices from patch filename. Extracts row and column indices from filenames matching pattern "r{row}_c{col}". :param name: Filename or path containing patch indices. :returns: A tuple containing (row, col) indices. :rtype: Tuple[int, int] :raises ValueError: If the pattern cannot be found in the filename. .. py:function:: normalize_patch_name(name: str) -> str Normalize patch name to standard format. Converts a patch filename to a normalized format "r{row:03d}_c{col:03d}". :param name: Filename or path containing patch indices. :returns: Normalized patch name string (e.g., "r001_c002"). :rtype: str :raises ValueError: If patch indices cannot be parsed from the name. .. py:function:: compute_ssim_1d(y_true: numpy.ndarray, y_pred: numpy.ndarray) -> float Compute 1D Structural Similarity Index (SSIM) between true and predicted values. Computes SSIM by reshaping 1D arrays into 2D and using structural similarity metric. Returns NaN if arrays are too small or have zero range. :param y_true: True values as 1D numpy array. :param y_pred: Predicted values as 1D numpy array. :returns: SSIM value between -1 and 1, or NaN if computation is not possible. :rtype: float .. py:function:: compute_regression_metrics(y_true: numpy.ndarray, y_pred: numpy.ndarray) -> Dict[str, float] Compute regression metrics for model evaluation. Calculates multiple regression metrics including MSE, MAE, R², Pearson correlation, and SSIM. :param y_true: True target values as numpy array. :param y_pred: Predicted values as numpy array. :returns: Dictionary containing: - mse: Mean squared error. - mae: Mean absolute error. - r2: R² score. - pearson_r: Pearson correlation coefficient. - ssim: Structural similarity index. :rtype: Dict[str, float] .. py:function:: setup_dummy_cellvit_modules() -> None Set up dummy modules for 'cellvit' and its submodules. This is required to simulate the module structure that the torch model expects when loading .pt files that reference cellvit classes. Creates dummy modules in sys.modules to handle unpickling of objects that reference cellvit.data.dataclass.cell_graph classes. .. py:function:: load_global_embeddings(pt_path: pathlib.Path) -> numpy.ndarray Load global cell embeddings from a .pt file. Sets up dummy cellvit modules and loads embeddings from a PyTorch .pt file. The loaded object must have an 'x' attribute containing the embeddings. :param pt_path: Path to the .pt file containing embeddings. :returns: Cell embeddings as a numpy array of shape (n_cells, emb_dim). :rtype: np.ndarray :raises RuntimeError: If the loaded object does not have an 'x' attribute. .. py:function:: open_level_chw(path: pathlib.Path, level: int, n_channels: int = 3) -> Tuple[zarr.Array, Any] Open a specific pyramid level from an OME-TIFF file as CHW format. Opens the specified pyramid level from an OME-TIFF file and returns it as a zarr array in channel-first format (C, H, W). :param path: Path to the OME-TIFF file. :param level: Pyramid level index to open. :param n_channels: Expected number of channels. Defaults to 3. :returns: A tuple containing: - arr: Zarr array in CHW format (channels, height, width). - tf: TiffFile object (should be closed after use). :rtype: Tuple[zarr.Array, Any] :raises RuntimeError: If no zarr.Array is found or the array shape is unsupported. .. py:function:: compute_level_scale(cfg: TutorialConfig) -> Tuple[int, int, int, int, float, float, int, int] Compute scale factors between pyramid levels and patch grid dimensions. Opens level0 and level1 from the OME-TIFF, computes scale factors, and calculates the number of patches needed to cover level1. :param cfg: TutorialConfig instance with OME-TIFF path and patch settings. :returns: A tuple containing: - H0: Height of level 0. - W0: Width of level 0. - H1: Height of level 1. - W1: Width of level 1. - scale_y: Vertical scale factor (H1 / H0). - scale_x: Horizontal scale factor (W1 / W0). - n_rows: Number of patch rows needed. - n_cols: Number of patch columns needed. :rtype: Tuple[int, int, int, int, float, float, int, int] .. py:function:: generate_patch_json_and_embeddings(cfg: TutorialConfig) -> None Generate patch-specific JSON and embedding files from global data. Splits global nuclei JSON and embeddings into per-patch files based on the patch grid computed from level1. Each patch gets a JSON file with cell contours and an embedding .pt file. :param cfg: TutorialConfig instance with paths and settings. :raises RuntimeError: If there's a mismatch between JSON cells and embeddings. .. py:function:: list_available_patches(cfg: TutorialConfig) -> List[str] List patch names that have both JSON and IF image files. Finds patches that have corresponding files in both patch_json_dir and if_patch_dir, matching the IF patch pattern. :param cfg: TutorialConfig instance with directory paths and IF pattern. :returns: List of normalized patch names (e.g., ["r001_c002", ...]). :rtype: List[str] .. py:function:: expand_nuclei_watershed_multichannel(nuclear_polys_list: List[numpy.ndarray], image_shape: Tuple[int, Ellipsis], if_img: Optional[numpy.ndarray] = None, guide_channel: Optional[int] = None, expansion_distance: int = 20, extend_nuclei: bool = True) -> numpy.ndarray Expand nuclear polygons using watershed segmentation. Creates a cell mask by expanding nuclear contours using watershed segmentation. Can use IF image intensity as a guide for expansion. :param nuclear_polys_list: List of nuclear polygon contours as numpy arrays. :param image_shape: Shape of the image (height, width) or (height, width, channels). :param if_img: Optional IF image to guide expansion. If provided, uses intensity to guide watershed. Defaults to None. :param guide_channel: Optional channel index to use as guide. If None and if_img is provided, uses mean across channels. Defaults to None. :param expansion_distance: Maximum distance for expansion in pixels. Defaults to 20. :param extend_nuclei: Whether to extend nuclei using watershed. If False, returns only the nuclear mask. Defaults to True. :returns: Cell mask as uint16 array with cell IDs as pixel values. :rtype: np.ndarray .. py:function:: calculate_protein_expression_single_channel(cell_mask: numpy.ndarray, if_img: numpy.ndarray, nuclear_polys_list: List[numpy.ndarray], channel_idx: int) -> Tuple[numpy.ndarray, List[Dict[str, Union[int, float]]]] Calculate protein expression per cell for a single IF channel. Computes mean and total expression values for each cell in the cell mask from the specified IF channel. :param cell_mask: Cell mask array with cell IDs as pixel values. :param if_img: IF image as HxWxC numpy array. :param nuclear_polys_list: List of nuclear polygon contours. :param channel_idx: Index of the IF channel to analyze. :returns: A tuple containing: - expr_mean: Array of mean expression values per cell. - cell_info: List of dictionaries with cell information: - cell_id: Cell ID in the mask. - channel_idx: Channel index. - mean: Mean expression value. - total: Total expression value. - cell_area: Cell area in pixels. - nuclear_area: Nuclear area in pixels. :rtype: Tuple[np.ndarray, List[Dict[str, Union[int, float]]]] :raises ValueError: If if_img is not 3D or channel_idx is out of range. .. py:function:: load_predictions_for_patch(prediction_json_path: pathlib.Path, patch_base: str, channel_idx: int) -> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] Load per-cell predictions for a specific patch and channel. :param prediction_json_path: Path to the prediction JSON file. :param patch_base: Patch name in format "r{row:03d}_c{col:03d}". :param channel_idx: Channel index to filter predictions. :returns: A tuple containing: - local_idxs: Array of local cell indices. - y_true: Array of true values. - y_pred: Array of predicted values. :rtype: Tuple[np.ndarray, np.ndarray, np.ndarray] :raises FileNotFoundError: If the prediction JSON file does not exist. :raises RuntimeError: If no prediction records are found for the patch and channel. .. py:function:: load_patch_json_and_embeddings(cfg: TutorialConfig, row: int, col: int) -> Tuple[List[numpy.ndarray], numpy.ndarray, pathlib.Path, pathlib.Path] Load patch JSON and embeddings for a specific patch. :param cfg: TutorialConfig instance with directory paths. :param row: Patch row index. :param col: Patch column index. :returns: A tuple containing: - nuclear_polys_list: List of nuclear polygon contours. - emb: Cell embeddings as numpy array. - json_path: Path to the loaded JSON file. - emb_path: Path to the loaded embedding file. :rtype: Tuple[List[np.ndarray], np.ndarray, Path, Path] :raises FileNotFoundError: If patch JSON or embedding file does not exist. :raises RuntimeError: If embedding format is unsupported. .. py:function:: load_patch_if_image(cfg: TutorialConfig, row: int, col: int) -> Tuple[numpy.ndarray, pathlib.Path] Load IF patch image for a specific patch. :param cfg: TutorialConfig instance with IF patch directory and pattern. :param row: Patch row index. :param col: Patch column index. :returns: A tuple containing: - if_img: IF image as numpy array (H, W, C). If input is 2D, adds a channel dimension. - path: Path to the loaded IF image file. :rtype: Tuple[np.ndarray, Path] :raises FileNotFoundError: If the IF patch file does not exist. .. py:function:: build_dataset_for_patches(cfg: TutorialConfig, patch_names: Iterable[str], channel_idx: int, guide_channel: Optional[int] = None, expansion_distance: int = 20, extend_nuclei: bool = True, is_enrich: bool = False, neighbor_k: int = 5) -> Tuple[numpy.ndarray, numpy.ndarray, List[Dict[str, Union[int, str]]]] Build training dataset from multiple patches. Processes patches to extract embeddings and protein expression values, optionally enriching embeddings with spatial and morphological features. :param cfg: TutorialConfig instance with paths and settings. :param patch_names: Iterable of patch names to process. :param channel_idx: IF channel index to extract expression from. :param guide_channel: Optional channel index for watershed expansion guide. Defaults to None. :param expansion_distance: Maximum distance for watershed expansion. Defaults to 20. :param extend_nuclei: Whether to extend nuclei using watershed. Defaults to True. :param is_enrich: Whether to enrich embeddings with additional features. Defaults to False. :param neighbor_k: Number of neighbors for enrichment features. Defaults to 5. :returns: A tuple containing: - X: Feature matrix of shape (n_cells, n_features). - y: Target values array of shape (n_cells,). - meta: List of metadata dictionaries with keys: - patch: Patch name. - local_cell_idx: Local cell index within patch. - channel_idx: Channel index. :rtype: Tuple[np.ndarray, np.ndarray, List[Dict[str, Union[int, str]]]] :raises RuntimeError: If patch_names is empty or no valid patches are found. .. py:function:: build_enrichment_features(nuclear_polys_list: List[numpy.ndarray], cell_info: List[Dict[str, Any]], embeddings: numpy.ndarray, k_neighbors: int = 5) -> numpy.ndarray Build enrichment features from spatial and morphological information. Creates additional features including nuclear area, cell area, mean distance to neighbors, and mean neighbor embeddings. :param nuclear_polys_list: List of nuclear polygon contours. :param cell_info: List of cell information dictionaries with keys: nuclear_area, cell_area. :param embeddings: Cell embeddings array of shape (n_cells, emb_dim). :param k_neighbors: Number of nearest neighbors to consider. Defaults to 5. :returns: Enrichment features array of shape (n_cells, n_enrich_features). :rtype: np.ndarray .. py:function:: compute_neighbor_statistics(centroids: numpy.ndarray, embeddings: numpy.ndarray, k: int = 5) -> Tuple[numpy.ndarray, numpy.ndarray] Compute neighbor statistics for enrichment features. Calculates mean distance to k nearest neighbors and mean embeddings of those neighbors for each cell. :param centroids: Cell centroids array of shape (n_cells, 2). :param embeddings: Cell embeddings array of shape (n_cells, emb_dim). :param k: Number of nearest neighbors to consider. Defaults to 5. :returns: A tuple containing: - mean_dist: Mean distance to k nearest neighbors per cell. - mean_emb: Mean embeddings of k nearest neighbors per cell. :rtype: Tuple[np.ndarray, np.ndarray] .. py:function:: l2_normalize_embeddings(X: numpy.ndarray) -> numpy.ndarray L2-normalize embeddings row-wise. Normalizes each row (embedding vector) to unit length. Rows with zero norm are set to have norm 1.0 to avoid division by zero. :param X: Embedding matrix of shape (n_samples, n_features). :returns: L2-normalized embedding matrix with same shape as input. :rtype: np.ndarray .. py:function:: train_lightgbm_regressor(X_train: numpy.ndarray, y_train: numpy.ndarray, X_val: numpy.ndarray, y_val: numpy.ndarray, random_state: int = 42, num_boost_round: int = 2000, early_stopping_rounds: int = 100) -> Tuple[Any, Dict[str, float], numpy.ndarray] Train a LightGBM regressor model. Trains a LightGBM regression model with early stopping and returns the trained model, validation metrics, and predictions. :param X_train: Training feature matrix of shape (n_train, n_features). :param y_train: Training target values of shape (n_train,). :param X_val: Validation feature matrix of shape (n_val, n_features). :param y_val: Validation target values of shape (n_val,). :param random_state: Random seed for reproducibility. Defaults to 42. :param num_boost_round: Maximum number of boosting rounds. Defaults to 2000. :param early_stopping_rounds: Number of rounds to wait for improvement before stopping. Defaults to 100. :returns: A tuple containing: - model: Trained LightGBM model. - metrics: Dictionary of validation metrics (mse, mae, r2, etc.). - y_val_pred: Validation predictions array. :rtype: Tuple[Any, Dict[str, float], np.ndarray] .. py:function:: predict_with_model(model: Any, X: numpy.ndarray) -> numpy.ndarray Make predictions using a trained LightGBM model. Uses the best iteration if available, otherwise uses all iterations. :param model: Trained LightGBM model. :param X: Feature matrix of shape (n_samples, n_features). :returns: Predictions array of shape (n_samples,). :rtype: np.ndarray .. py:function:: save_prediction_json(path: pathlib.Path, channel_idx: int, extend_nuclei: bool, normalize_embeddings: bool, metrics: Dict[str, Any], meta: List[Dict[str, Any]], y_true: numpy.ndarray, y_pred: numpy.ndarray) -> None Save per-cell predictions to a JSON file. Saves predictions along with metadata and metrics to a JSON file for later analysis and visualization. :param path: Path to save the JSON file. :param channel_idx: IF channel index. :param extend_nuclei: Whether nuclei were extended using watershed. :param normalize_embeddings: Whether embeddings were normalized. :param metrics: Dictionary of evaluation metrics. :param meta: List of metadata dictionaries with patch and cell information. :param y_true: True target values array. :param y_pred: Predicted values array. .. py:function:: plot_pred_vs_true(y_true: numpy.ndarray, y_pred: numpy.ndarray, title: str = 'Pred vs True') -> Tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes] Plot predicted vs true values scatter plot. Creates a scatter plot with a diagonal reference line for regression evaluation. :param y_true: True target values array. :param y_pred: Predicted values array. :param title: Plot title. Defaults to "Pred vs True". :returns: Matplotlib figure and axes objects. :rtype: Tuple[plt.Figure, plt.Axes] .. py:function:: plot_residual_hist(y_true: numpy.ndarray, y_pred: numpy.ndarray, title: str = 'Residuals') -> Tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes] Plot histogram of prediction residuals. Creates a histogram of residuals (predicted - true) for regression evaluation. :param y_true: True target values array. :param y_pred: Predicted values array. :param title: Plot title. Defaults to "Residuals". :returns: Matplotlib figure and axes objects. :rtype: Tuple[plt.Figure, plt.Axes] .. py:function:: visualize_patch_prediction(cfg: TutorialConfig, patch_name: str, prediction_json_path: pathlib.Path, channel_idx: int, guide_channel: Optional[int] = None, expansion_distance: int = 20, extend_nuclei: bool = True, predictions_are_log: bool = True, colorbar_range: Optional[Tuple[float, float]] = None, save_path: Optional[pathlib.Path] = None, show_boundaries: bool = False) -> Tuple[matplotlib.pyplot.Figure, numpy.ndarray] Plot measured vs predicted expression per cell on one patch. Produces a 1x2 figure showing: - IF channel with measured expression filled per cell - IF channel with predicted expression filled per cell :param cfg: TutorialConfig instance with paths and settings. :param patch_name: Patch name in format "r{row:03d}_c{col:03d}". :param prediction_json_path: Path to prediction JSON file. :param channel_idx: IF channel index to visualize. :param guide_channel: Optional channel index for watershed expansion guide. Defaults to None. :param expansion_distance: Maximum distance for watershed expansion. Defaults to 20. :param extend_nuclei: Whether nuclei were extended using watershed. Defaults to True. :param predictions_are_log: Whether predictions are stored as log1p values. If True, converts to raw scale. Defaults to True. :param colorbar_range: Optional tuple (vmin, vmax) for colorbar range. If None, uses 2nd and 98th percentiles. Defaults to None. :param save_path: Optional path to save the figure. Defaults to None. :param show_boundaries: Whether to show cell boundaries. Defaults to False. :returns: Matplotlib figure and axes array. :rtype: Tuple[plt.Figure, np.ndarray] :raises ValueError: If channel_idx is out of range for the IF image.