Case study 1: Gastrulation erythroid maturation

This tutorial shows how cellDancer analyzes RNA velocity. The process includes (1) estimating RNA velocity, (2) deriving cell fates on embedding space, and (3) estimating pseudotime.

This tutorial also demonstrates cellDancer’s ability to resolve the difficulty of predicting the velocity of genes with transcriptional boost mentioned by Barile et al.

Below is the case study for gastrulation erythroid maturation. We select 12,329 cells with 2,000 genes from haemato-endothelial progenitors, blood progenitors 1/2, and erythroid 1/2/3. The embryonic days are selected from E7.0, E7.25, E7.5, E7.75, E8.0, E8.25, and E8.5.

Import packages

To run the notebook locally, Installation could be referred to install the environment and dependencies.

[1]:
import os
import sys
import glob
import pandas as pd
import math
import matplotlib.pyplot as plt
import celldancer as cd
import celldancer.cdplt as cdplt
from celldancer.cdplt import colormap

Load data

The input data for cellDancer contains the abundances of unspliced RNA and spliced RNA. For the detail of obtaining the dataset and pre-processing steps, Data Preparation could be referred to. We follow the standard data preparation procedures in scVelo with default parameters except that we use 100 nearest neighbors for first-moment calculation.

The data of gastrulation erythroid maturation could be downloaded and unzipped from GastrulationErythroid_cell_type_u_s.csv.zip. It could be loaded by pd.read_csv('your_path/GastrulationErythroid_cell_type_u_s.csv'). To load your own data, the dataframe should contain columns ‘gene_name’, ‘unsplice’, ‘splice’ ,‘cellID’ ,‘clusters’ ,‘embedding1’ , and ‘embedding2.’ For a detailed description of the data structure, Preprocessing could be referred to.

[2]:
cell_type_u_s_path='your_path/GastrulationErythroid_cell_type_u_s.csv'
cell_type_u_s=pd.read_csv(cell_type_u_s_path)
cell_type_u_s
[2]:
gene_name unsplice splice cellID clusters embedding1 embedding2
0 Sox17 0.000000 0.043971 cell_363 Blood progenitors 2 3.460521 15.574629
1 Sox17 0.000000 0.000000 cell_382 Blood progenitors 2 2.490433 14.971734
2 Sox17 0.000000 0.018161 cell_385 Blood progenitors 2 2.351203 15.267069
3 Sox17 0.000000 0.000000 cell_393 Blood progenitors 2 5.899098 14.388825
4 Sox17 0.000000 0.000000 cell_398 Blood progenitors 2 4.823139 15.374831
... ... ... ... ... ... ... ...
24657995 Gm47283 0.214961 1.145533 cell_139318 Erythroid3 8.032358 7.603037
24657996 Gm47283 0.300111 1.072944 cell_139321 Erythroid3 10.352904 6.446736
24657997 Gm47283 0.292607 1.199875 cell_139326 Erythroid3 9.464873 7.261099
24657998 Gm47283 0.266031 1.114659 cell_139327 Erythroid3 9.990495 7.243880
24657999 Gm47283 0.164675 1.111358 cell_139330 Erythroid3 8.260699 7.935455

24658000 rows × 7 columns

Estimate RNA velocity for sample genes

Here, 15 genes in gene_list are estimated as an example. The predicted unspliced and spliced reads, alpha, beta, and gamma are added to the dataframe.

[3]:
gene_list=['Smarca2', 'Rbms2', 'Myo1b', 'Hba-x', 'Yipf5', 'Skap1', 'Smim1', 'Nfkb1', 'Sulf2', 'Blvrb', 'Hbb-y', 'Coro2b', 'Yipf5', 'Phc2', 'Mllt3']

loss_df, cellDancer_df=cd.velocity(cell_type_u_s,\
                                   gene_list=gene_list,\
                                   permutation_ratio=0.125,\
                                   n_jobs=8)
cellDancer_df
Using /Users/shengyuli/Library/CloudStorage/OneDrive-HoustonMethodist/work/Velocity/bin/cellDancer_polish/analysis/CaseStudyNotebook/cellDancer_velocity_2022-10-20 10-11-23 as the output path.
Arranging genes for parallel job.
14  genes were arranged to  2  portions.

Velocity Estimation:   0%|          | 0/2 [00:00<?, ?it/s]
Velocity Estimation:  50%|█████     | 1/2 [00:20<00:20, 20.31s/it]
Velocity Estimation: 100%|██████████| 2/2 [00:29<00:00, 13.99s/it]

[3]:
cellIndex gene_name unsplice splice unsplice_predict splice_predict alpha beta gamma loss cellID clusters embedding1 embedding2
0 0 Mllt3 0.087552 0.061677 0.131864 0.347319 0.264903 0.304851 0.335579 0.111875 cell_363 Blood progenitors 2 3.460521 15.574629
1 1 Mllt3 0.042049 0.056451 0.096471 0.192440 0.253070 0.303873 0.335266 0.111875 cell_382 Blood progenitors 2 2.490433 14.971734
2 2 Mllt3 0.046162 0.073611 0.099587 0.222645 0.253818 0.304017 0.335363 0.111875 cell_385 Blood progenitors 2 2.351203 15.267069
3 3 Mllt3 0.119952 0.142994 0.156111 0.534318 0.270146 0.306117 0.336443 0.111875 cell_393 Blood progenitors 2 5.899098 14.388825
4 4 Mllt3 0.027221 0.136220 0.083502 0.221573 0.243896 0.304314 0.336083 0.111875 cell_398 Blood progenitors 2 4.823139 15.374831
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
172601 12324 Yipf5 0.098889 0.443361 0.101194 0.466415 0.080456 0.198707 0.447713 0.060657 cell_139318 Erythroid3 8.032358 7.603037
172602 12325 Yipf5 0.092605 0.521375 0.094589 0.530523 0.074335 0.197698 0.450130 0.060657 cell_139321 Erythroid3 10.352904 6.446736
172603 12326 Yipf5 0.108367 0.443695 0.109323 0.473431 0.081982 0.198220 0.447828 0.060657 cell_139326 Erythroid3 9.464873 7.261099
172604 12327 Yipf5 0.067904 0.431785 0.074884 0.433994 0.076296 0.200459 0.447011 0.060657 cell_139327 Erythroid3 9.990495 7.243880
172605 12328 Yipf5 0.077812 0.388213 0.084044 0.402742 0.080739 0.200674 0.445712 0.060657 cell_139330 Erythroid3 8.260699 7.935455

172606 rows × 14 columns

Visualize the phase portraits

We visualize the phase portrait of each gene with cdplt.scatter_gene().

From the figures below, the prediction of those genes that have transcriptional boost (Hba-x, Smim1, and Hbb-y) are consistent with the progression of gastrulation erythroid maturation of (1) haemato-endothelial progenitors, (2) blood progenitors 1/2, and (3) erythroid. This reflects the ability of cellDancer to resolve the RNA velocity for the genes in the transcriptional boost scenario.

[4]:
ncols=5
height=math.ceil(len(gene_list)/5)*4
fig = plt.figure(figsize=(20,height))

for i in range(len(gene_list)):
    ax = fig.add_subplot(math.ceil(len(gene_list)/ncols), ncols, i+1)
    cdplt.scatter_gene(
        ax=ax,
        x='splice',
        y='unsplice',
        cellDancer_df=cellDancer_df,
        custom_xlim=None,
        custom_ylim=None,
        colors=colormap.colormap_erythroid,
        alpha=0.5,
        s = 5,
        velocity=True,
        gene=gene_list[i])

    ax.set_title(gene_list[i])
    ax.axis('off')

plt.show()
https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_13_0.png

Project the RNA velocity onto the embedding space

To project the prediction of RNA velocity onto the embedding space and to estimate pseudotime using all 2,000 genes, predicted result can be downloaded from GastrulationErythroid_cellDancer_estimation.csv.zip

[5]:
# load the prediction result of all genes
cellDancer_df_path = 'your_path/GastrulationErythroid_cellDancer_estimation.csv'
cellDancer_df = pd.read_csv(cellDancer_df_path)

We calculate the projection of RNA velocity on the embedding with cd.compute_cell_velocity(). The projected direction on embedding space, i.e. columns ‘velocity1’ and ‘velocity2’ are added to the original dataframe. We use cdplt.scatter_cell() to display the predicted direction on embedding space.

[6]:
# compute cell velocity
cellDancer_df=cd.compute_cell_velocity(cellDancer_df=cellDancer_df, projection_neighbor_choice='gene', expression_scale='power10', projection_neighbor_size=10, speed_up=(100,100))

# plot cell velocity
fig, ax = plt.subplots(figsize=(10,10))
cdplt.scatter_cell(ax,
                   cellDancer_df,
                   colors=colormap.colormap_erythroid,
                   alpha=0.5,
                   s=10,
                   velocity=True,
                   legend='on',
                   min_mass=15,
                   arrow_grid=(20,20),
                   custom_xlim=[-6,13],
                   custom_ylim=[2,16], )
ax.axis('off')
plt.show()
https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_18_0.png
[7]:
cellDancer_df
[7]:
cellIndex gene_name unsplice splice unsplice_predict splice_predict alpha beta gamma loss cellID clusters embedding1 embedding2 index velocity1 velocity2
0 0 Ift81 0.010658 0.026321 0.009261 0.029549 0.023595 0.042598 0.073999 0.039626 cell_363 Blood progenitors 2 3.460521 15.574629 0 NaN NaN
1 1 Ift81 0.000000 0.044266 0.000946 0.037888 0.020649 0.042942 0.074502 0.039626 cell_382 Blood progenitors 2 2.490433 14.971734 1 NaN NaN
2 2 Ift81 0.000000 0.064559 0.000885 0.055191 0.019326 0.042876 0.075031 0.039626 cell_385 Blood progenitors 2 2.351203 15.267069 2 NaN NaN
3 3 Ift81 0.000000 0.020756 0.001014 0.017791 0.022149 0.043030 0.073879 0.039626 cell_393 Blood progenitors 2 5.899098 14.388825 3 NaN NaN
4 4 Ift81 0.000000 0.013184 0.001037 0.011305 0.022633 0.043055 0.073676 0.039626 cell_398 Blood progenitors 2 4.823139 15.374831 4 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24657995 12324 Mcrip1 0.000000 1.128435 0.000157 1.125867 0.005920 0.038023 0.013131 0.051755 cell_139318 Erythroid3 8.032358 7.603037 12324 NaN NaN
24657996 12325 Mcrip1 0.024356 0.970672 0.016090 1.428982 0.008338 0.037053 0.013950 0.051755 cell_139321 Erythroid3 10.352904 6.446736 12325 NaN NaN
24657997 12326 Mcrip1 0.000000 0.899107 0.000175 0.897000 0.006575 0.037644 0.013522 0.051755 cell_139326 Erythroid3 9.464873 7.261099 12326 NaN NaN
24657998 12327 Mcrip1 0.017375 1.398107 0.011387 1.729827 0.006885 0.037765 0.013271 0.051755 cell_139327 Erythroid3 9.990495 7.243880 12327 NaN NaN
24657999 12328 Mcrip1 0.005194 0.988933 0.003540 1.086199 0.006822 0.037569 0.013596 0.051755 cell_139330 Erythroid3 8.260699 7.935455 12328 NaN NaN

24658000 rows × 17 columns

Estimate pseudotime

Based on the projection of RNA velocity on embedding space, we estimate the pseudotime with cd.pseudo_time().

The long trajectories used for pseudotime dertermination are (optionally) shown. In this case, the three long trajectories are colored from light to dark according to their own unadjusted pseudotime. The cells are colored according to the long trajectories that are used for determination of the pseudotime of the cells.

[8]:
import random
# set parameters
dt = 0.05
t_total = {dt:int(10/dt)}
n_repeats = 10

# estimate pseudotime
cellDancer_df = cd.pseudo_time(cellDancer_df=cellDancer_df,
                               grid=(30,30),
                               dt=dt,
                               t_total=t_total[dt],
                               n_repeats=n_repeats,
                               speed_up=(100,100),
                               n_paths = 3,
                               plot_long_trajs=True,
                               psrng_seeds_diffusion=[i for i in range(n_repeats)],
                               n_jobs=8)
Pseudo random number generator seeds are set to:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Generating Trajectories: 100%|████████████| 9510/9510 [00:02<00:00, 3175.15it/s]
https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_23_2.png
There are 3 clusters.
[0 1 2]
--- 257.67111110687256 seconds ---
[9]:
# plot pseudotime
fig, ax = plt.subplots(figsize=(6,6))
im=cdplt.scatter_cell(ax,cellDancer_df, colors='pseudotime', alpha=0.5, velocity=False, custom_xlim=(-5,11), custom_ylim=(4,18))
ax.axis('off')
plt.show()
https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_24_0.png

The connection network below is another method to display pseudotime (cdplt.PTO_Graph). The edge lengths indicate the time difference between nodes (the closer in pseudotime, the shorter the edge length). The sizes of the nodes are proportional to the pseudotime.

[10]:
fig, ax= plt.subplots(figsize=(10,10))

cdplt.PTO_Graph(ax,
                cellDancer_df,
                node_layout='forcedirected',
                PRNG_SEED=10,
                use_edge_bundling=True,
                node_colors=colormap.colormap_erythroid,
                edge_length=3,
                node_sizes='pseudotime',
                colorbar='on',
                legend='on')
[10]:
<AxesSubplot:>
https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_26_1.png

Display the abundance of spliced RNA along pseudotime

We visualize the spliced RNA abundance of genes along pseudotime with cdplt.scatter_gene().

[11]:
ncols=5
height=math.ceil(len(gene_list)/ncols)*4
fig = plt.figure(figsize=(20,height))

for i in range(len(gene_list)):
    ax = fig.add_subplot(math.ceil(len(gene_list)/ncols), ncols, i+1)
    cdplt.scatter_gene(
        ax=ax,
        x='pseudotime',
        y='splice',
        cellDancer_df=cellDancer_df,
        custom_xlim=None,
        custom_ylim=None,
        colors=colormap.colormap_erythroid,
        alpha=0.5,
        s = 5,
        velocity=False,
        gene=gene_list[i])

    ax.set_title(gene_list[i])
    ax.axis('off')
https://raw.githack.com/GuangyuWangLab2021/cellDancer_website/main/docs/_images/notebooks_case_study_gastrulation_29_0.png