The summary of 'Single-cell trajectory and pseudotime analysis with Monocle3 and Seurat in R'

This summary of the video was created by an AI. It might contain some inaccuracies.

The video provides a comprehensive tutorial on using Monocle 3 for single-cell trajectory analysis and pseudo-time analysis with a mouse embryonic fibroblast dataset transitioning into iPSCs. Essential steps covered include importing and managing large datasets using Seurat, pre-processing tasks like normalization and scaling, and integrating data sets. Key processes discussed are creating a Cell Data Set (CDS), clustering cells, ordering them in pseudo time, and visualizing transformations using UMAP. Notable genes such as Sox2, Nanog, and collagen are analyzed for expression changes along trajectories. Challenges like data integration issues, memory crashes, and optimization strategies using CPU cores are also addressed. The tutorial emphasizes the importance of graphical interfaces for more intuitive data filtering and subset analysis, concluding with the essential functionalities of Monocle 3, such as clustering, trajectory creation, differential gene expression, and visual plotting techniques.

00:00:00

In this segment, the video explains how to use Monocle 3 to build single-cell trajectories and perform pseudo-time analysis on a reprogramming dataset of approximately 300,000 cells. The dataset tracks mouse embryonic fibroblasts de-differentiating into iPSCs over 18 days. The speaker imports data from Geo and discusses installing necessary dependencies and encountering common installation errors. They mention using Seurat for loading and pre-processing H5 format files due to Seurat’s popularity and ease of use. They specifically focus on the “two I” treatment subset to manage the large dataset size.

00:03:00

In this part of the video, the speaker explains the process of creating a simple function to read H5 files, which lack metadata. They extract the day information from the files and create a Surat object, filtering out cells with fewer than 200 genes and outliers based on mitochondrial reads and percentiles. The day information is added to the metadata, and the function returns the processed object. The speaker uses `S apply` to generate a list of data objects from 80 files, merges them into a single Surat object, and conducts basic pre-processing, including normalization, finding variable features, scaling, computing PCs, and generating UMAP to prevent errors in future steps. Integration was avoided due to issues with data collapsing based on days interpreted as batch effects, but the speaker mentions that for data from the same system requiring adjustment for batch effects, integration steps involving scaling and PCA should be followed.

00:06:00

In this part of the video, the presenter discusses processing cell data using RPCA and integrating it into one data object, then visualizing the results with a UMAP grouped by day. The presenter converts the day metadata into numeric values to create a smooth time progression of cell states from fibroblasts to iPSCs. The integrated data is then converted into a Monocle 3 Cell Data Set (CDS). Default settings are used to run cell clustering, and the results are examined, showing distinct partitions. The cells are then clustered and colored by partitions to visualize their separation. The presenter notes that while Monocle by default creates separate trajectories for each partition, they prefer to connect them into the same trajectory. This requires changing the partition setting to false when learning the graph, which is then processed even though it may take several hours due to the large number of cells.

00:09:00

In this segment, the video discusses ordering cells and plotting pseudo time using a specific tool that allows for graphical interaction. The user can choose nodes via a pop-up window to easily plot pseudo time, and the results closely mimic actual known time, tracking a trajectory from fibroblasts to induced pluripotent stem cells (iPSCs). The importance of observing this trajectory is highlighted, particularly for cells that do not successfully reprogram to iPSCs. The segment explains how to analyze gene expression changes along this path, such as Sox2 and Nanog increasing, while collagen fluctuates. It is noted that a potential error related to gene names can occur due to conversion issues between the tools used, and a workaround is suggested. Additionally, the video touches on differential expression capabilities in the monocle tool, though attempting to model it with a large dataset caused a memory crash.

00:12:00

In this segment of the video, the speaker focuses on identifying genes that change with pseudo time using a graph test. They explain the process of passing a CDs (CellDataSet) object, ensuring the principal graph is included, and adjusting the number of CPU cores based on available memory to speed up the task. Due to the large number of cells, the process takes hours, so they pre-ran it and saved the results as an RDS file, which they load into their workspace.

The output is a data frame listing each gene, along with statistics like Moran’s test statistic (spatial correlation) and Q value (corrected p-value). The speaker filters this data frame to retain genes with Q values less than 0.05, resulting in 16,000 spatially correlated genes. They then order the genes based on Moran’s test statistic and highlight some interesting genes such as serpent, collagen (a fibroblast marker), and others.

To visualize these genes, they plot the top genes, showing their spatial correlation along the trajectories. However, these plots do not reveal decision points or trajectory branches. To address this, the speaker uses Monocle’s “choose cell” function to subset specific cells, making the process easier with a graphical user interface. They subset to a crucial branching point and rerun the previous analysis, resulting in a smaller subset of 3,500 cells. The speaker compares this with a Monocle example that has distinct cell types and branches, explaining that their data might not show distinct branches as clearly but proceeds with the same analysis approach.

00:15:00

In this part of the video, the speaker focuses on filtering insignificant values from a dataset, which speeds up the process to about six minutes. They demonstrate how plotting might not look optimal due to the rectangular subset chosen but stress this is a simple tutorial. The speaker introduces another graphing function from model 3, which requires taking a subset of a subset, focusing on specific genes. Despite the resulting messy plot, the speaker highlights this as an example based on a distinct branch in the monocle 3 tutorial, showing gene expression as a function of pseudo time. The segment concludes with a summary of basic functions of monocle, including clustering, partitioning, trajectory creation, differentially expressed genes identification, data subsetting, and plotting pseudo time functions.