25 单细胞数据处理

在生物信息学和数据分析领域，对公开发表的单细胞表达谱数据和元数据进行整合与处理是至关重要的一步。为了进行后续的分析和挖掘，这些原始数据需要被转换成适合处理的Seurat数据对象(Hao 等 2024)。以下是对这一过程的书面化描述：

25.1 加载R包

使用rm(list = ls())来清空环境中的所有变量。

library(tidyverse)
library(data.table)
library(Seurat)

rm(list = ls())
options(stringsAsFactors = F)
options(future.globals.maxSize = 10000 * 1024^2)

grp_names <- c("Early Stage", "Late Stage")
grp_colors <- c("#8AC786", "#B897CA")
grp_shapes <- c(15, 16)

25.2 导入数据

GSE149614所有数据获取方式见小节 6.5

phenotype <- fread("./data/GSE149614_scRNA/GSE149614_HCC.metadata.updated.txt")
profile <- fread("./data/GSE149614_scRNA/GSE149614_HCC.scRNAseq.S71915.count.txt")

25.3 原始数据转换成Seurat对象

Seurat对象(Hao 等 2024)是单细胞分析常见的数据对象，Seurat::CreateSeuratObject函数可以将原始数据转换成Seurat对象。

Seurat是一个集成了多种处理单细胞RNA测序数据功能的R包。它允许研究人员方便地对单细胞数据进行探索、预处理、分析和可视化。通过Seurat对象，研究人员可以执行细胞类型分类、基因表达分析、细胞群聚等操作。Seurat数据对象是一种高度灵活的数据结构，主要用于存储单细胞RNA测序数据。

CreateObject <- function(
    count, 
    metadata, 
    proj = "GSE149614"){
  
  res <- Seurat::CreateSeuratObject(
    counts = count.cln, 
    assay = "RNA",
    min.cells = 3,
    # min.features = 400,
    meta.data = metadata.cln,
    project = proj)
  res[["Batch"]] <-  proj
  
  return(res)
}

Seurat_raw <- CreateObject(
  count = profile,
  metadata = phenotype,
  proj = "GSE149614")

Seurat_raw


Seurat_raw <- readRDS("./data/result/scRNA/Seurat_raw.RDS")

Seurat_raw

结果：Seurat对象需要注意的列信息：

orig.ident: 该列将包含已知的样本身份。如果在加载数据时提供了project参数的值，它将会默认使用这个值（通过Seurat_raw@meta.data$orig.ident访问）；
nCount_RNA/nUMI: 表示每个细胞中UMI（Unique Molecular Identifier，唯一分子标识符）的数量。UMI通常用于区分PCR扩增产生的相同序列的分子，以便更准确地估计基因的表达水平（通过Seurat_raw@meta.data$nCount_RNA访问）；
nFeature_RNA/nGene: 表示每个细胞中检测到的基因数量。这里的“基因”可能指的是检测到的转录本或独特的RNA分子。这一数据通常用于质量控制，例如排除基因表达量过低或过高（可能是噪声或异常值）的细胞（通过Seurat_raw@meta.data$nFeature_RNA访问）。

Seurat对象的count matrix


GetAssayData(Seurat_raw, assay = "RNA", layer = "counts")[c("ODF4", "MC4R", "UBOX5-AS1"), 1:10]

结果：矩阵中的.值代表0（未检测到分子）。由于scRNA-seq矩阵中的大多数值都是0，Seurat在可能的情况下使用稀疏矩阵表示法。这可以为Drop-seq/inDrop/10x数据节省大量内存和速度。


dense.size <- object.size(as.matrix(GetAssayData(Seurat_raw, assay = "RNA", layer = "counts")))
dense.size
sparse.size <- object.size(GetAssayData(Seurat_raw, assay = "RNA", layer = "counts"))
sparse.size
dense.size / sparse.size

25.4 数据过滤

为了更容易地识别不同的细胞类型群体，我们需要过滤数据以仅包含高质量的真实细胞。同时，我们需要识别任何失败的样本，并尝试挽救数据或将其从分析中移除，此外，还要尝试理解样本失败的原因。

数据过滤的挑战在1）区分质量差的细胞和复杂性较低的细胞；2）选择合适的过滤阈值，以便保留高质量细胞而不移除生物学上相关的细胞类型。因此，在进行质量控制之前，请明确您期望样本中存在的细胞类型。例如，您是否期望样本中有低复杂性的细胞或线粒体表达水平较高的细胞？如果是这样，那么在评估数据质量时，我们需要考虑这种生物学特性。

25.4.1 过滤指标

number of genes per UMI for each cell：只需要取每个细胞检测到的基因数量的以10为底的对数，以及每个细胞UMI数量的以10为底的对数，然后将基因数量的对数值除以UMI数量的对数值。


Seurat_raw$log10GenesPerUMI

Mitochondrial Ratio: 映射到线粒体基因的转录本比例


Seurat_raw$mitoRatio <- PercentageFeatureSet(object = Seurat_raw, pattern = "^MT-", assay = "RNA")
Seurat_raw$mitoRatio <- Seurat_raw@meta.data$mitoRatio / 100

使用分布图可视化过滤指标在样本或者细胞等分布


Seurat_raw@meta.data %>% 
  ggplot(aes(x = sample, fill = sample)) + 
  geom_bar() +
  geom_hline(yintercept = c(500, 1800)) +  
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Seurat_raw@meta.data %>% 
  ggplot(aes(x = nCount_RNA, color = sample, fill = sample)) + 
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  geom_vline(xintercept = 500) +  
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Seurat_raw@meta.data %>% 
  ggplot(aes(x = nFeature_RNA, color = sample, fill= sample)) + 
  geom_density(alpha = 0.2) + 
  scale_x_log10() + 
  geom_vline(xintercept = 300) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Seurat_raw@meta.data %>% 
  ggplot(aes(x = log10GenesPerUMI, color = sample, fill= sample)) + 
  geom_density(alpha = 0.2) + 
  scale_x_log10() + 
  geom_vline(xintercept = 0.8) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) 

Seurat_raw@meta.data %>% 
  ggplot(aes(x = mitoRatio, color = sample, fill= sample)) + 
  geom_density(alpha = 0.2) + 
  scale_x_log10() + 
  geom_vline(xintercept = 0.2) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

结果：从每个过滤指标的分布图确定最近过滤阈值。


VlnPlot(Seurat_raw, 
        features = c("nCount_RNA", "nFeature_RNA", 
                     "log10GenesPerUMI", "mitoRatio"), 
        pt.size = 0, ncol = 2)

25.4.2 过滤处理

根据上述过滤指标的阈值，过滤数据。以下是过滤的原因：

过滤不符合要求的细胞或基因


Seurat_filter <- subset(
  x = Seurat_raw,
  subset = (nCount_RNA > 500) & (nCount_RNA < 150000) &
    (nFeature_RNA > 200) & (nFeature_RNA < 7500) &
    (log10GenesPerUMI > 0.8) & 
    (mitoRatio < 0.2))

Seurat_filter

结果：有2497个细胞不符合上述过滤指标阈值要求被丢弃。


VlnPlot(Seurat_filter, 
        features = c("nCount_RNA", "nFeature_RNA", 
                     "log10GenesPerUMI", "mitoRatio"), 
        pt.size = 0, ncol = 2)

25.5 输出结果

if (!dir.exists("./data/result/scRNA/")) {
  dir.create("./data/result/scRNA/", recursive = TRUE)
}

saveRDS(Seurat_raw, file = "./data/result/scRNA/Seurat_raw.RDS", compress = TRUE)
saveRDS(Seurat_filter, file = "./data/result/scRNA/Seurat_filter.RDS", compress = TRUE)

25.6 总结

整个过程涉及数据获取、数据转换、数据过滤和数据保存等多个步骤，是生物信息学和数据分析中不可或缺的一部分。通过这一过程，我们可以从公开发表的单细胞表达谱数据中提取有价值的信息，为后续研究核心特征如SLC6A8等在癌症早晚期的细胞水平表达研究提供有力的支持。接下来需要对该数据进行数据标准化等操作。

系统信息

sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] Seurat_5.0.3       SeuratObject_5.0.2 sp_2.1-4           data.table_1.15.4 
 [5] lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4       
 [9] purrr_1.0.2        readr_2.1.5        tidyr_1.3.1        tibble_3.2.1      
[13] ggplot2_3.5.1      tidyverse_2.0.0   

loaded via a namespace (and not attached):
  [1] deldir_2.0-4           pbapply_1.7-2          gridExtra_2.3         
  [4] rlang_1.1.3            magrittr_2.0.3         RcppAnnoy_0.0.22      
  [7] spatstat.geom_3.2-9    matrixStats_1.3.0      ggridges_0.5.6        
 [10] compiler_4.3.3         reshape2_1.4.4         png_0.1-8             
 [13] vctrs_0.6.5            pkgconfig_2.0.3        fastmap_1.1.1         
 [16] utf8_1.2.4             promises_1.3.0         rmarkdown_2.26        
 [19] tzdb_0.4.0             xfun_0.43              jsonlite_1.8.8        
 [22] goftest_1.2-3          later_1.3.2            spatstat.utils_3.0-4  
 [25] irlba_2.3.5.1          parallel_4.3.3         cluster_2.1.6         
 [28] R6_2.5.1               ica_1.0-3              spatstat.data_3.0-4   
 [31] stringi_1.8.4          RColorBrewer_1.1-3     reticulate_1.37.0     
 [34] parallelly_1.37.1      scattermore_1.2        lmtest_0.9-40         
 [37] Rcpp_1.0.12            knitr_1.46             tensor_1.5            
 [40] future.apply_1.11.2    zoo_1.8-12             sctransform_0.4.1     
 [43] httpuv_1.6.15          Matrix_1.6-5           splines_4.3.3         
 [46] igraph_2.0.3           timechange_0.3.0       tidyselect_1.2.1      
 [49] abind_1.4-5            rstudioapi_0.16.0      yaml_2.3.8            
 [52] spatstat.random_3.2-3  spatstat.explore_3.2-7 codetools_0.2-19      
 [55] miniUI_0.1.1.1         listenv_0.9.1          plyr_1.8.9            
 [58] lattice_0.22-6         shiny_1.8.1.1          withr_3.0.0           
 [61] ROCR_1.0-11            evaluate_0.23          Rtsne_0.17            
 [64] future_1.33.2          fastDummies_1.7.3      survival_3.7-0        
 [67] polyclip_1.10-6        fitdistrplus_1.1-11    pillar_1.9.0          
 [70] BiocManager_1.30.23    KernSmooth_2.23-22     renv_1.0.0            
 [73] plotly_4.10.4          generics_0.1.3         RcppHNSW_0.6.0        
 [76] hms_1.1.3              munsell_0.5.1          scales_1.3.0          
 [79] globals_0.16.3         xtable_1.8-4           glue_1.7.0            
 [82] lazyeval_0.2.2         tools_4.3.3            RSpectra_0.16-1       
 [85] RANN_2.6.1             leiden_0.4.3.1         dotCall64_1.1-1       
 [88] cowplot_1.1.3          grid_4.3.3             colorspace_2.1-0      
 [91] nlme_3.1-164           patchwork_1.2.0        cli_3.6.2             
 [94] spatstat.sparse_3.0-3  spam_2.10-0            fansi_1.0.6           
 [97] viridisLite_0.4.2      uwot_0.2.2             gtable_0.3.5          
[100] digest_0.6.35          progressr_0.14.0       ggrepel_0.9.5         
[103] htmlwidgets_1.6.4      htmltools_0.5.8.1      lifecycle_1.0.4       
[106] httr_1.4.7             mime_0.12              MASS_7.3-60.0.1