Hierarchical Cell Type Annotation Based on scANVI
Overview
To provide a standardized method for annotating cell types at different resolutions, this protocol describes a hierarchical cell type annotation workflow based on scANVI. The method enables multi-resolution annotation of single-cell RNA-seq datasets by:
- First identifying broad cell classes (primary cell types).
- Then refining annotations to more specific cell types within each class.
- Optionally extending to additional hierarchical levels.
- Integrating datasets in a shared latent space for downstream analysis and visualization.
Materials and Software
- Software Requirements:
- scvi-tools
- Python (≥3.8 recommended)
- Required Python libraries (e.g., NumPy, Pandas, scikit-learn, PyTorch)
- Data:
- Raw gene expression counts from single-cell RNA sequencing studies.
- Reference datasets for machine learning (these should contain annotated cell types).
- Computational Resources:
- A machine capable of handling model training and inference (preferably with a good GPU).
Steps
Step 1: Data Preparation
- Gene Selection:
- Perform differential expression analysis across annotated cell types in the reference dataset to identify relevant genes for annotation.
- In this study, 1,841 representative genes were selected.
- Data Splitting:
- Split your reference dataset into training and validation sets at a ratio of 5:1.
- Use the raw UMI count matrix of these selected genes as input.
Step 2: Model Training
- Initial Model Training:
- Train an scVI model on the training data using 5 epochs.
- Transfer Learning:
- Fine-tune the pre-trained scVI model using its parameters for training the scANVI model specifically for cell class annotation.
- Hyperparameter Exploration:
- Explore various hyperparameters including:
- Latent space dimension: between 10 and 100
- Network layers: between 1 and 10
- Different initializations (up to 10 different seeds).
- Select the best model based on validation performance.
Step 3: Hierarchical Annotation
- Cell Class Models:
- Train a total of 31 cell class models for the annotation of specific cell types at the second level.
- Third-Level Annotation (if necessary):
- Repeat the training process for additional specificity as required.
Step 4: Data Integration Using Latent Space
- Integration of Datasets:
- Employ scANVI to infer the latent space for integration.
- Visualize Data:
- Use UMAP (Uniform Manifold Approximation and Projection) to visualize integrated datasets in the latent space.
Step 5: Model Training Details
- scANVI Model Settings:
- Configure the model to have two layers and a latent space dimension of 50.
- Use a negative binomial likelihood for gene expression modeling.
- Early-Stopping Strategy:
- Implement an early-stopping mechanism based on evidence lower bound metrics:
- Stop training if metrics do not improve for five epochs (set threshold at 0).
- Learning Rate Adjustment:
- Apply a learning rate reduction when the loss function plateaus:
- Patience: 8 epochs
- Reduction factor: 0.1
Step 6: Semi-Supervised Training
- Training on Whole Dataset:
- Conduct semi-supervised training with early stopping based on classification accuracy:
- Set patience and threshold at 5 epochs and 0.001, respectively.
- Implement similar learning rate plateau adjustments as described before.
Conclusion
This protocol outlines a structured approach to annotate cell types hierarchically using scANVI, facilitating both class-level and specific cell-type annotation in single-cell RNA sequencing data.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this
article to respond.