Data Preparation¶
In our tutorials of Resolve the transcriptional boost in gastrulation erythroid maturation data, Analyze RNA velocity for mouse hippocampal dentate gyrus neurogenesis data, and Decode cell sub-population based on kinetic parameters of pancreatic endocrinogenesis data, the data has already been prepared. To load your own data, the data fromat and the method to prepare the data will be introduced.
The input data for the prediction of velocity is in csv
format. And could be loaded using pandas.read_csv()
. Taking the cell_type_u_s_sample_df.csv below as an example of 5 cells corresponding to 2 genes, the gene information is represented by columns of gene_name
, unsplice
, and splice
. The cell information is represented by columns cellID
, clusters
, embedding1
, and
embedding2
. unsplice
and splice
columns represent the spliced and unspliced counts, seperately. cellID
is the unique id of each cell. clusters
represents the cell type of each cell. embedding1
and embedding2
are the 2-dimensional representation of all cells such as UMAP, PCA, or t-SNE.
[1]:
import pandas as pd
cell_type_u_s=pd.read_csv('your_path/cell_type_u_s_sample_df.csv')
cell_type_u_s
[1]:
gene_name | unsplice | splice | cellID | clusters | embedding1 | embedding2 | |
---|---|---|---|---|---|---|---|
0 | Hba-x | 0.000000 | 0.123217 | cell_363 | Blood progenitors 2 | 3.460521 | 15.574629 |
1 | Hba-x | 0.000000 | 0.008806 | cell_385 | Blood progenitors 2 | 2.351203 | 15.267069 |
2 | Hba-x | 0.023665 | 21.719713 | cell_592 | Erythroid1 | 6.170377 | 12.916482 |
3 | Hba-x | 0.447068 | 301.915400 | cell_16475 | Erythroid2 | 8.311832 | 9.724998 |
4 | Hba-x | 0.665660 | 637.665650 | cell_139318 | Erythroid3 | 8.032358 | 7.603037 |
5 | Sulf2 | 0.000000 | 0.033960 | cell_363 | Blood progenitors 2 | 3.460521 | 15.574629 |
6 | Sulf2 | 0.000000 | 0.050277 | cell_385 | Blood progenitors 2 | 2.351203 | 15.267069 |
7 | Sulf2 | 0.000000 | 0.033758 | cell_592 | Erythroid1 | 6.170377 | 12.916482 |
8 | Sulf2 | 0.000000 | 0.011413 | cell_16475 | Erythroid2 | 8.311832 | 9.724998 |
9 | Sulf2 | 0.000000 | 0.007784 | cell_139318 | Erythroid3 | 8.032358 | 7.603037 |
Format transfer¶
The two count matrices of unspliced and spliced abundances could be obtained from standard sequencing protocols. They could be counted by velocyto or loompy/kallisto pipeline.
We also provide a function (celldancer.utilities.adata_to_csv()
) to transfer from Anndata (a format of storing annotated data, usually are loom file) to csv
format. For example, after the preprocessing of data in Anndata format. To transfer to csv format, . celldancer.utilities.adata_to_raw_with_embed()
could be used. For example, in the command of:
celldancer.utilities.adata_to_raw_with_embed(adata,us_para=['Mu','Ms'], cell_type_para='celltype', embed_para='X_umap', save_path='cell_type_u_s.csv', gene_list=['Hba-x','Smim1'])
[ ]:
splice
and unsplice
columns are obtained from the ['Ms', 'Mu']
attributes of adata.layers
. Also, cellID
column is obtained from adata.obs.index
. clusters
column is obtained from ['celltype']
of adata.obs
. The embedding1
and embedding2
columns are obtained from [‘X_umap’] attribute of adata.obsm
.