6  数据收集

在确定研究疾病为肝细胞癌(Liver Hepatocellular Carcinoma: HCC)后,系统地进行了文献回顾,专注于搜索与HCC相关的荟萃分析文章,以获取该领域的研究动态和已有成果。为了支持的研究,通过在线资源检索并下载了必要的数据集。具体而言,利用了国际癌症基因组协作组(ICGC)数据门户ICGC Data Portal (参考 小节 5.2),美国国家癌症研究所的基因组数据公共库GenomicDataCommonsDataPorta (参考 小节 5.3),以及基因表达综合数据库GeneExpressionOmnibus (参考 小节 5.1) 这三个主要的数据仓库。

6.1 数据分布

四个HCC转录组数据集分别是LICA-FRLIRI-JPLIHC-US/TCGA-LIHCGSE14520,一个HCC单细胞数据集是GSE149614

6.2 表达谱数据

针对上述HCC数据集进行表达谱的下载,分别获得了它们的不同类型的表达谱数据:

  • LICA-FR: France (Tumor TNM stage)

  • LIRI-JP: Japan (Tumor TNM stage)

  • LIHC-US/TCGA-LIHC: TCGA (Tumor TNM stage)

  • GSE14520: GEO (Tumor TNM stage)

  • GSE149614: GEO (Tumor TNM stage)

6.3 最终数据分布

在仔细过滤和整理收集到的表型数据后,

  • LICA-FR (drop) : France (TPM normalization)

  • LIRI-JP: Japan (TPM normalization)

  • LIHC-US/TCGA-LIHC: TCGA (TPM normalization)

  • GSE14520: GEO (The chip data were standardized based on a robust multichip average method)

6.4 自动下载GSE14520

为了简化数据获取流程并避免重复下载相似的数据,开发了一个R脚本,用于自动从Gene Expression Omnibus (GEO) 下载数据集。

# 加载R包
library(GEOquery)
library(tidyverse)
library(stringr)
library(optparse)
library(convert)
library(idmap1)

# 设置参数
GEO_name <- "GSE14520"
GPL_number <- "GPL571"
GPL_number2 <- "GPL3921"
Array_type <- "array"
dir <- "./"

# clinical and expression profile
gset <- getGEO(GEO = GEO_name,
               destdir = dir,
               AnnotGPL = F,
               getGPL = F)
phen <- pData(gset[[1]])
prof <- exprs(gset[[1]])

# output
outdir <- paste0(dir, "/", GEO_name, "_process/")
if (!dir.exists(outdir)) {
  dir.create(outdir)
}

phen_origin <- paste0(outdir, GEO_name, "_clinical_origin.csv")
phen_process <- paste0(outdir, GEO_name, "_clinical_post.csv")
write.csv(phen, file = phen_origin, row.names = F)
write.csv(phen_post, file = phen_process, row.names = F)

prof_origin <- paste0(outdir, GEO_name, "_profile_origin.tsv")
prof_process <- paste0(outdir, GEO_name, "_profile_post.tsv")
write.table(data.frame(prof) %>% rownames_to_column("GeneID"),
            file = prof_origin, row.names = F, quote = F, sep = "\t")
write.table(prof_post %>% rownames_to_column("GeneID"),
            file = prof_process, row.names = F, quote = F, sep = "\t")

probe2gene_name <- paste0(outdir, GPL_number, "_probe2gene_table.tsv")
write.table(probe2gene, file = probe2gene_name,
            row.names = F, quote = F, sep = "\t")

ExprSet_name <- paste0(outdir, GEO_name, "_GeneExprSet.RDS")
saveRDS(ExprSet_object, file = ExprSet_name)

message("Congrats, Program Ended without problems")

上述代码由以下几部分组成:

  • 加载R包;

  • 设置GSE和GPL编号,其中GPL是GSE对应的平台号,通常适用于芯片探针数据;

  • 进行基因ID转换和表型数据筛选;

  • 将数据转换为ExpressionSet数据对象,该类型数据包含表型数据和表达谱数据,便于后续的下游分析。

最后获得了以下数据:

GSE14520/
├── GPL3921.soft.gz
├── GPL571.soft.gz
├── GSE14520-GPL3921_series_matrix.txt.gz
├── GSE14520-GPL571_series_matrix.txt.gz
├── GSE14520_Extra_Supplement.txt
├── GSE14520_process
│   ├── GPL571_probe2gene_table.tsv
│   ├── GSE14520_GeneExprSet.RDS
│   ├── GSE14520_clinical_origin.csv
│   ├── GSE14520_clinical_post.csv
│   ├── GSE14520_profile_origin.tsv
│   └── GSE14520_profile_post.tsv
├── download.R
└── work.sh

6.5 下载GSE149614

GSE149614数据集是单细胞数据集

GSE149614_scRNA/
├── 41467_2022_32283_MOESM1_ESM.pdf
├── 41467_2022_32283_MOESM3_ESM.zip
├── A single-cell atlas of the multicellular ecosystem of primary and metastatic hepatocellular carcinoma.pdf
├── GSE149614_HCC.metadata.updated.txt
├── GSE149614_HCC.scRNAseq.S71915.count.txt
├── GSE149614_HCC.scRNAseq.S71915.count.txt.gz
├── GSE149614_HCC.scRNAseq.S71915.normalized.txt.gz
└── SupplementaryData
    ├── Supplementary Data 1.xlsx
    ├── Supplementary Data 10.xlsx
    ├── Supplementary Data 11.xlsx
    ├── Supplementary Data 2.xlsx
    ├── Supplementary Data 3.xlsx
    ├── Supplementary Data 4.xlsx
    ├── Supplementary Data 5.xlsx
    ├── Supplementary Data 6.xlsx
    ├── Supplementary Data 7.xlsx
    ├── Supplementary Data 8.xlsx
    └── Supplementary Data 9.xlsx

6.6 下载其它数据

最后获得了以下数据:

LIRI-JP/
├── donor.LIRI-JP.tsv
├── exp_seq.LIRI-JP.tsv
├── sample.LIRI-JP.tsv
└── specimen.LIRI-JP.tsv

TCGA_LIHC/
├── TCGA-LIHC.htseq_counts.tsv
├── TCGA-LIHC.htseq_fpkm.tsv
└── TCGA-LIHC_clinical_origin.csv