Data Preparation¶

In our tutorials of Case study 1: Gastrulation erythroid maturation, Case Study 2: Mouse hippocampus development, Case study 3: Pancreatic endocrinogenesis, and Case study 4: Human embryo glutamatergic neurogenesis, the data has already been prepared. To load your own data, the data format and the method to prepare the data is introduced below.

The input data for the prediction of velocity is in csv format. And could be loaded using pd.read_csv(). Taking the cell_type_u_s_sample_df.csv below as an example of 5 cells corresponding to 2 genes as shown below:

import pandas as pd
cell_type_u_s=pd.read_csv('your_path/cell_type_u_s_sample_df.csv')
cell_type_u_s

The gene information is represented by columns of gene_name, unsplice, and splice. The cell information is represented by columns cellID, clusters, embedding1, and embedding2. unsplice and splice columns represent the spliced and unspliced counts, separately. cellID is the unique id of each cell. clusters represent the cell type of each cell. embedding1 and embedding2 are the 2-dimensional representation of all cells such as UMAP, PCA, or t-SNE.

Preprocessing¶

Here is an overview of preparing from the scRNA sequencing data, fastq. We need our reads to be mapped to intron regions to get the counts of unspliced mRNA. If you are running 10X protocol, as 10X suggested ‘if you are running the cellranger count pipeline, you can add the –include-introns flag to the command.’

After the mapping, the two count matrices of unspliced and spliced abundances could be obtained by velocyto. We can combine the loom files of different samples with loompy . After this step, the ‘layers’ of adata will contain ‘unsplice’ and ‘splice’.

There is one more step to get the ‘spliced’ and ‘unspliced’ in our pandas DataFrame. We need the moments of the spliced and unspliced counts. Thus, we can use the pre-processing steps below to get the moments:

# !pip install scvelo --upgrade --quiet
import scvelo as scv
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30) # cell amount will influence the setting of n_neighbors

Now we get the ‘Mu’ and ‘Ms’ in ‘layers’ of adata. ‘Mu’ and ‘Ms’ will be used in our prediction of RNA velocity.

The embedding space (e.g., t-SNE or UMAP) in ‘obsm’ of adata could be obtained using SCANPY. Identifing cell type will be helpful for the futher analysis of RNA velocity. Here shows examples of identifing cell types.

Below is an example of pre-processed adata:

AnnData object with n_obs × n_vars = 23496 × 18964
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'donor_id', 'prob_max', 'prob_doublet', 'n_vars', 'best_singlet', 'doublet_logLikRatio', 'nCount_SCT', 'nFeature_SCT', 'pANN', 'gender', 'location', 'donors', 'SCT_snn_res.0.5', 'seurat_clusters', 'S.Score', 'G2M.Score', 'Phase', 'old.ident', 'RNA_snn_res.0.4', 'celltype', 'batch', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size'
    var: 'name', 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    uns: 'cell_types_colors'
    obsm: 'X_pca', 'X_umap'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced', 'Mu', 'Ms'

Transferring adata format to dataframe¶

We provide a function (cd.adata_to_df_with_embed()) to transfer from Anndata (a format of storing annotated data, usually are loom file) to csv format. For example, after the preprocessing in Anndata format, to transfer to pandas.DataFrame/csv format, cd.adata_to_df_with_embed() could be used. For example, in the command of:

import pandas as pd
import celldancer.utilities as cdutil
cdutil.adata_to_df_with_embed(adata,
                              us_para=['Mu','Ms'],
                              cell_type_para='celltype',
                              embed_para='X_umap',
                              save_path='cell_type_u_s.csv',
                              gene_list=['Hba-x','Smim1'])

splice and unsplice columns are obtained from the ['Ms', 'Mu'] attributes of adata.layers. Also, cellID column is obtained from adata.obs.index. clusters column is obtained from ['celltype'] of adata.obs. The embedding1 and embedding2 columns are obtained from ['X_umap'] attribute of adata.obsm.

You may also want to visualize genes of interest or have an overview of multiple genes before the prediction using the scripts shown below.

import matplotlib.pyplot as plt
import celldancer.cdplt as cdplt
import math

gene_list=list(set(cellDancer_df.gene_name))[0:10] # or specify gene(s) of interest

ncols=5
height=math.ceil(len(gene_list)/5)*4
fig = plt.figure(figsize=(20,height))

for i in range(len(gene_list)):
    ax = fig.add_subplot(math.ceil(len(gene_list)/ncols), ncols, i+1)
    cdplt.scatter_gene(
        ax=ax,
        x='splice',
        y='unsplice',
        cellDancer_df=cellDancer_df,
        custom_xlim=None,
        custom_ylim=None,
        alpha=0.5,
        s = 5,
        velocity=False,
        gene=gene_list[i])

    ax.set_title(gene_list[i])
    ax.axis('on')

plt.show()