Data Preparation¶

In our tutorials of Resolve the transcriptional boost in gastrulation erythroid maturation data, Analyze RNA velocity for mouse hippocampal dentate gyrus neurogenesis data, and Decode cell sub-population based on kinetic parameters of pancreatic endocrinogenesis data, the data has already been prepared. To load your own data, the data fromat and the method to prepare the data will be introduced.

The input data for the prediction of velocity is in csv format. And could be loaded using pandas.read_csv(). Taking the cell_type_u_s_sample_df.csv below as an example of 5 cells corresponding to 2 genes, the gene information is represented by columns of gene_name, unsplice, and splice. The cell information is represented by columns cellID, clusters, embedding1, and embedding2. unsplice and splice columns represent the spliced and unspliced counts, seperately. cellID is the unique id of each cell. clusters represents the cell type of each cell. embedding1 and embedding2 are the 2-dimensional representation of all cells such as UMAP, PCA, or t-SNE.

[1]:

import pandas as pd
cell_type_u_s=pd.read_csv('your_path/cell_type_u_s_sample_df.csv')
cell_type_u_s

[1]:

	gene_name	unsplice	splice	cellID	clusters	embedding1	embedding2
0	Hba-x	0.000000	0.123217	cell_363	Blood progenitors 2	3.460521	15.574629
1	Hba-x	0.000000	0.008806	cell_385	Blood progenitors 2	2.351203	15.267069
2	Hba-x	0.023665	21.719713	cell_592	Erythroid1	6.170377	12.916482
3	Hba-x	0.447068	301.915400	cell_16475	Erythroid2	8.311832	9.724998
4	Hba-x	0.665660	637.665650	cell_139318	Erythroid3	8.032358	7.603037
5	Sulf2	0.000000	0.033960	cell_363	Blood progenitors 2	3.460521	15.574629
6	Sulf2	0.000000	0.050277	cell_385	Blood progenitors 2	2.351203	15.267069
7	Sulf2	0.000000	0.033758	cell_592	Erythroid1	6.170377	12.916482
8	Sulf2	0.000000	0.011413	cell_16475	Erythroid2	8.311832	9.724998
9	Sulf2	0.000000	0.007784	cell_139318	Erythroid3	8.032358	7.603037

Format transfer¶

The two count matrices of unspliced and spliced abundances could be obtained from standard sequencing protocols. They could be counted by velocyto or loompy/kallisto pipeline.

We also provide a function (celldancer.utilities.adata_to_csv()) to transfer from Anndata (a format of storing annotated data, usually are loom file) to csv format. For example, after the preprocessing of data in Anndata format. To transfer to csv format, . celldancer.utilities.adata_to_raw_with_embed() could be used. For example, in the command of:

celldancer.utilities.adata_to_raw_with_embed(adata,us_para=['Mu','Ms'], cell_type_para='celltype', embed_para='X_umap', save_path='cell_type_u_s.csv', gene_list=['Hba-x','Smim1'])

[ ]:

splice and unsplice columns are obtained from the ['Ms', 'Mu'] attributes of adata.layers. Also, cellID column is obtained from adata.obs.index. clusters column is obtained from ['celltype'] of adata.obs. The embedding1 and embedding2 columns are obtained from [‘X_umap’] attribute of adata.obsm.