Data Preparation

In our tutorials of Resolve the transcriptional boost in gastrulation erythroid maturation data, Analyze RNA velocity for mouse hippocampal dentate gyrus neurogenesis data, and Decode cell sub-population based on kinetic parameters of pancreatic endocrinogenesis data, the data has already been prepared. To load your own data, the data fromat and the method to prepare the data will be introduced.

The input data for the prediction of velocity is in csv format. And could be loaded using pandas.read_csv(). Taking the cell_type_u_s_sample_df.csv below as an example of 5 cells corresponding to 2 genes, the gene information is represented by columns of gene_name, unsplice, and splice. The cell information is represented by columns cellID, clusters, embedding1, and embedding2. unsplice and splice columns represent the spliced and unspliced counts, seperately. cellID is the unique id of each cell. clusters represents the cell type of each cell. embedding1 and embedding2 are the 2-dimensional representation of all cells such as UMAP, PCA, or t-SNE.

[1]:
import pandas as pd
cell_type_u_s=pd.read_csv('your_path/cell_type_u_s_sample_df.csv')
cell_type_u_s
[1]:
gene_name unsplice splice cellID clusters embedding1 embedding2
0 Hba-x 0.000000 0.123217 cell_363 Blood progenitors 2 3.460521 15.574629
1 Hba-x 0.000000 0.008806 cell_385 Blood progenitors 2 2.351203 15.267069
2 Hba-x 0.023665 21.719713 cell_592 Erythroid1 6.170377 12.916482
3 Hba-x 0.447068 301.915400 cell_16475 Erythroid2 8.311832 9.724998
4 Hba-x 0.665660 637.665650 cell_139318 Erythroid3 8.032358 7.603037
5 Sulf2 0.000000 0.033960 cell_363 Blood progenitors 2 3.460521 15.574629
6 Sulf2 0.000000 0.050277 cell_385 Blood progenitors 2 2.351203 15.267069
7 Sulf2 0.000000 0.033758 cell_592 Erythroid1 6.170377 12.916482
8 Sulf2 0.000000 0.011413 cell_16475 Erythroid2 8.311832 9.724998
9 Sulf2 0.000000 0.007784 cell_139318 Erythroid3 8.032358 7.603037

Format transfer

The two count matrices of unspliced and spliced abundances could be obtained from standard sequencing protocols. They could be counted by velocyto or loompy/kallisto pipeline.

We also provide a function (celldancer.utilities.adata_to_csv()) to transfer from Anndata (a format of storing annotated data, usually are loom file) to csv format. For example, after the preprocessing of data in Anndata format. To transfer to csv format, . celldancer.utilities.adata_to_raw_with_embed() could be used. For example, in the command of:

celldancer.utilities.adata_to_raw_with_embed(adata,us_para=['Mu','Ms'], cell_type_para='celltype', embed_para='X_umap', save_path='cell_type_u_s.csv', gene_list=['Hba-x','Smim1'])

[ ]:

splice and unsplice columns are obtained from the ['Ms', 'Mu'] attributes of adata.layers. Also, cellID column is obtained from adata.obs.index. clusters column is obtained from ['celltype'] of adata.obs. The embedding1 and embedding2 columns are obtained from [‘X_umap’] attribute of adata.obsm.