Case study 1: Gastrulation erythroid maturation¶

This tutorial shows how cellDancer analyzes RNA velocity. The process includes (1) estimating RNA velocity, (2) deriving cell fates on embedding space, and (3) estimating pseudotime.

This tutorial also demonstrates cellDancer’s ability to resolve the difficulty of predicting the velocity of genes with transcriptional boost mentioned by Barile et al.

Below is the case study for gastrulation erythroid maturation. We select 12,329 cells with 2,000 genes from haemato-endothelial progenitors, blood progenitors 1/2, and erythroid 1/2/3. The embryonic days are selected from E7.0, E7.25, E7.5, E7.75, E8.0, E8.25, and E8.5.

Import packages¶

To run the notebook locally, Installation could be referred to install the environment and dependencies.

[1]:

import os
import sys
import glob
import pandas as pd
import math
import matplotlib.pyplot as plt
import celldancer as cd
import celldancer.cdplt as cdplt
from celldancer.cdplt import colormap

Load data¶

The input data for cellDancer contains the abundances of unspliced RNA and spliced RNA. For the detail of obtaining the dataset and pre-processing steps, Data Preparation could be referred to. We follow the standard data preparation procedures in scVelo with default parameters except that we use 100 nearest neighbors for first-moment calculation.

The data of gastrulation erythroid maturation could be downloaded and unzipped from GastrulationErythroid_cell_type_u_s.csv.zip. It could be loaded by pd.read_csv('your_path/GastrulationErythroid_cell_type_u_s.csv'). To load your own data, the dataframe should contain columns ‘gene_name’, ‘unsplice’, ‘splice’ ,‘cellID’ ,‘clusters’ ,‘embedding1’ , and ‘embedding2.’ For a detailed description of the data structure, Preprocessing could be referred to.

[2]:

cell_type_u_s_path='your_path/GastrulationErythroid_cell_type_u_s.csv'
cell_type_u_s=pd.read_csv(cell_type_u_s_path)
cell_type_u_s

[2]:

	gene_name	unsplice	splice	cellID	clusters	embedding1	embedding2
0	Sox17	0.000000	0.043971	cell_363	Blood progenitors 2	3.460521	15.574629
1	Sox17	0.000000	0.000000	cell_382	Blood progenitors 2	2.490433	14.971734
2	Sox17	0.000000	0.018161	cell_385	Blood progenitors 2	2.351203	15.267069
3	Sox17	0.000000	0.000000	cell_393	Blood progenitors 2	5.899098	14.388825
4	Sox17	0.000000	0.000000	cell_398	Blood progenitors 2	4.823139	15.374831
...	...	...	...	...	...	...	...
24657995	Gm47283	0.214961	1.145533	cell_139318	Erythroid3	8.032358	7.603037
24657996	Gm47283	0.300111	1.072944	cell_139321	Erythroid3	10.352904	6.446736
24657997	Gm47283	0.292607	1.199875	cell_139326	Erythroid3	9.464873	7.261099
24657998	Gm47283	0.266031	1.114659	cell_139327	Erythroid3	9.990495	7.243880
24657999	Gm47283	0.164675	1.111358	cell_139330	Erythroid3	8.260699	7.935455

24658000 rows × 7 columns

Estimate RNA velocity for sample genes¶

Here, 15 genes in gene_list are estimated as an example. The predicted unspliced and spliced reads, alpha, beta, and gamma are added to the dataframe.

[3]:

gene_list=['Smarca2', 'Rbms2', 'Myo1b', 'Hba-x', 'Yipf5', 'Skap1', 'Smim1', 'Nfkb1', 'Sulf2', 'Blvrb', 'Hbb-y', 'Coro2b', 'Yipf5', 'Phc2', 'Mllt3']

loss_df, cellDancer_df=cd.velocity(cell_type_u_s,\
                                   gene_list=gene_list,\
                                   permutation_ratio=0.125,\
                                   n_jobs=8)
cellDancer_df

Using /Users/shengyuli/Library/CloudStorage/OneDrive-HoustonMethodist/work/Velocity/bin/cellDancer_polish/analysis/CaseStudyNotebook/cellDancer_velocity_2022-10-20 10-11-23 as the output path.
Arranging genes for parallel job.
14  genes were arranged to  2  portions.


Velocity Estimation:   0%|          | 0/2 [00:00<?, ?it/s]
Velocity Estimation:  50%|█████     | 1/2 [00:20<00:20, 20.31s/it]
Velocity Estimation: 100%|██████████| 2/2 [00:29<00:00, 13.99s/it]

[3]:

	cellIndex	gene_name	unsplice	splice	unsplice_predict	splice_predict	alpha	beta	gamma	loss	cellID	clusters	embedding1	embedding2
0	0	Mllt3	0.087552	0.061677	0.131864	0.347319	0.264903	0.304851	0.335579	0.111875	cell_363	Blood progenitors 2	3.460521	15.574629
1	1	Mllt3	0.042049	0.056451	0.096471	0.192440	0.253070	0.303873	0.335266	0.111875	cell_382	Blood progenitors 2	2.490433	14.971734
2	2	Mllt3	0.046162	0.073611	0.099587	0.222645	0.253818	0.304017	0.335363	0.111875	cell_385	Blood progenitors 2	2.351203	15.267069
3	3	Mllt3	0.119952	0.142994	0.156111	0.534318	0.270146	0.306117	0.336443	0.111875	cell_393	Blood progenitors 2	5.899098	14.388825
4	4	Mllt3	0.027221	0.136220	0.083502	0.221573	0.243896	0.304314	0.336083	0.111875	cell_398	Blood progenitors 2	4.823139	15.374831
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
172601	12324	Yipf5	0.098889	0.443361	0.101194	0.466415	0.080456	0.198707	0.447713	0.060657	cell_139318	Erythroid3	8.032358	7.603037
172602	12325	Yipf5	0.092605	0.521375	0.094589	0.530523	0.074335	0.197698	0.450130	0.060657	cell_139321	Erythroid3	10.352904	6.446736
172603	12326	Yipf5	0.108367	0.443695	0.109323	0.473431	0.081982	0.198220	0.447828	0.060657	cell_139326	Erythroid3	9.464873	7.261099
172604	12327	Yipf5	0.067904	0.431785	0.074884	0.433994	0.076296	0.200459	0.447011	0.060657	cell_139327	Erythroid3	9.990495	7.243880
172605	12328	Yipf5	0.077812	0.388213	0.084044	0.402742	0.080739	0.200674	0.445712	0.060657	cell_139330	Erythroid3	8.260699	7.935455

172606 rows × 14 columns

Visualize the phase portraits¶

We visualize the phase portrait of each gene with cdplt.scatter_gene().

From the figures below, the prediction of those genes that have transcriptional boost (Hba-x, Smim1, and Hbb-y) are consistent with the progression of gastrulation erythroid maturation of (1) haemato-endothelial progenitors, (2) blood progenitors 1/2, and (3) erythroid. This reflects the ability of cellDancer to resolve the RNA velocity for the genes in the transcriptional boost scenario.

[4]:

ncols=5
height=math.ceil(len(gene_list)/5)*4
fig = plt.figure(figsize=(20,height))

for i in range(len(gene_list)):
    ax = fig.add_subplot(math.ceil(len(gene_list)/ncols), ncols, i+1)
    cdplt.scatter_gene(
        ax=ax,
        x='splice',
        y='unsplice',
        cellDancer_df=cellDancer_df,
        custom_xlim=None,
        custom_ylim=None,
        colors=colormap.colormap_erythroid,
        alpha=0.5,
        s = 5,
        velocity=True,
        gene=gene_list[i])

    ax.set_title(gene_list[i])
    ax.axis('off')

plt.show()

https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_13_0.png

Project the RNA velocity onto the embedding space¶

To project the prediction of RNA velocity onto the embedding space and to estimate pseudotime using all 2,000 genes, predicted result can be downloaded from GastrulationErythroid_cellDancer_estimation.csv.zip

[5]:

# load the prediction result of all genes
cellDancer_df_path = 'your_path/GastrulationErythroid_cellDancer_estimation.csv'
cellDancer_df = pd.read_csv(cellDancer_df_path)

We calculate the projection of RNA velocity on the embedding with cd.compute_cell_velocity(). The projected direction on embedding space, i.e. columns ‘velocity1’ and ‘velocity2’ are added to the original dataframe. We use cdplt.scatter_cell() to display the predicted direction on embedding space.

[6]:

# compute cell velocity
cellDancer_df=cd.compute_cell_velocity(cellDancer_df=cellDancer_df, projection_neighbor_choice='gene', expression_scale='power10', projection_neighbor_size=10, speed_up=(100,100))

# plot cell velocity
fig, ax = plt.subplots(figsize=(10,10))
cdplt.scatter_cell(ax,
                   cellDancer_df,
                   colors=colormap.colormap_erythroid,
                   alpha=0.5,
                   s=10,
                   velocity=True,
                   legend='on',
                   min_mass=15,
                   arrow_grid=(20,20),
                   custom_xlim=[-6,13],
                   custom_ylim=[2,16], )
ax.axis('off')
plt.show()

https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_18_0.png

[7]:

cellDancer_df

[7]:

	cellIndex	gene_name	unsplice	splice	unsplice_predict	splice_predict	alpha	beta	gamma	loss	cellID	clusters	embedding1	embedding2	index	velocity1	velocity2
0	0	Ift81	0.010658	0.026321	0.009261	0.029549	0.023595	0.042598	0.073999	0.039626	cell_363	Blood progenitors 2	3.460521	15.574629	0	NaN	NaN
1	1	Ift81	0.000000	0.044266	0.000946	0.037888	0.020649	0.042942	0.074502	0.039626	cell_382	Blood progenitors 2	2.490433	14.971734	1	NaN	NaN
2	2	Ift81	0.000000	0.064559	0.000885	0.055191	0.019326	0.042876	0.075031	0.039626	cell_385	Blood progenitors 2	2.351203	15.267069	2	NaN	NaN
3	3	Ift81	0.000000	0.020756	0.001014	0.017791	0.022149	0.043030	0.073879	0.039626	cell_393	Blood progenitors 2	5.899098	14.388825	3	NaN	NaN
4	4	Ift81	0.000000	0.013184	0.001037	0.011305	0.022633	0.043055	0.073676	0.039626	cell_398	Blood progenitors 2	4.823139	15.374831	4	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
24657995	12324	Mcrip1	0.000000	1.128435	0.000157	1.125867	0.005920	0.038023	0.013131	0.051755	cell_139318	Erythroid3	8.032358	7.603037	12324	NaN	NaN
24657996	12325	Mcrip1	0.024356	0.970672	0.016090	1.428982	0.008338	0.037053	0.013950	0.051755	cell_139321	Erythroid3	10.352904	6.446736	12325	NaN	NaN
24657997	12326	Mcrip1	0.000000	0.899107	0.000175	0.897000	0.006575	0.037644	0.013522	0.051755	cell_139326	Erythroid3	9.464873	7.261099	12326	NaN	NaN
24657998	12327	Mcrip1	0.017375	1.398107	0.011387	1.729827	0.006885	0.037765	0.013271	0.051755	cell_139327	Erythroid3	9.990495	7.243880	12327	NaN	NaN
24657999	12328	Mcrip1	0.005194	0.988933	0.003540	1.086199	0.006822	0.037569	0.013596	0.051755	cell_139330	Erythroid3	8.260699	7.935455	12328	NaN	NaN

24658000 rows × 17 columns

Estimate pseudotime¶

Based on the projection of RNA velocity on embedding space, we estimate the pseudotime with cd.pseudo_time().

The long trajectories used for pseudotime dertermination are (optionally) shown. In this case, the three long trajectories are colored from light to dark according to their own unadjusted pseudotime. The cells are colored according to the long trajectories that are used for determination of the pseudotime of the cells.

[8]:

import random
# set parameters
dt = 0.05
t_total = {dt:int(10/dt)}
n_repeats = 10

# estimate pseudotime
cellDancer_df = cd.pseudo_time(cellDancer_df=cellDancer_df,
                               grid=(30,30),
                               dt=dt,
                               t_total=t_total[dt],
                               n_repeats=n_repeats,
                               speed_up=(100,100),
                               n_paths = 3,
                               plot_long_trajs=True,
                               psrng_seeds_diffusion=[i for i in range(n_repeats)],
                               n_jobs=8)

Pseudo random number generator seeds are set to:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Generating Trajectories: 100%|████████████| 9510/9510 [00:02<00:00, 3175.15it/s]

https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_23_2.png

There are 3 clusters.
[0 1 2]
--- 257.67111110687256 seconds ---

[9]:

# plot pseudotime
fig, ax = plt.subplots(figsize=(6,6))
im=cdplt.scatter_cell(ax,cellDancer_df, colors='pseudotime', alpha=0.5, velocity=False, custom_xlim=(-5,11), custom_ylim=(4,18))
ax.axis('off')
plt.show()

https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_24_0.png

The connection network below is another method to display pseudotime (cdplt.PTO_Graph). The edge lengths indicate the time difference between nodes (the closer in pseudotime, the shorter the edge length). The sizes of the nodes are proportional to the pseudotime.

[10]:

fig, ax= plt.subplots(figsize=(10,10))

cdplt.PTO_Graph(ax,
                cellDancer_df,
                node_layout='forcedirected',
                PRNG_SEED=10,
                use_edge_bundling=True,
                node_colors=colormap.colormap_erythroid,
                edge_length=3,
                node_sizes='pseudotime',
                colorbar='on',
                legend='on')

[10]:

<AxesSubplot:>

https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_26_1.png

Display the abundance of spliced RNA along pseudotime¶

We visualize the spliced RNA abundance of genes along pseudotime with cdplt.scatter_gene().

[11]:

ncols=5
height=math.ceil(len(gene_list)/ncols)*4
fig = plt.figure(figsize=(20,height))

for i in range(len(gene_list)):
    ax = fig.add_subplot(math.ceil(len(gene_list)/ncols), ncols, i+1)
    cdplt.scatter_gene(
        ax=ax,
        x='pseudotime',
        y='splice',
        cellDancer_df=cellDancer_df,
        custom_xlim=None,
        custom_ylim=None,
        colors=colormap.colormap_erythroid,
        alpha=0.5,
        s = 5,
        velocity=False,
        gene=gene_list[i])

    ax.set_title(gene_list[i])
    ax.axis('off')

https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_29_0.png