19  LASSO LR

采用了LASSO(Least Absolute Shrinkage and Selection Operator)结合LR(Logistic Regression:适合本教程二分类分组数据)的方法,对差异基因(参考 章节 11) 进行了特征筛选。通过这种方法,能够从大量的候选基因中识别出与特定生物过程或疾病状态最为相关的关键基因。筛选出的这些重要基因不仅具有统计学上的显著性,而且在实际应用中具有较高的预测能力。这些经过LASSO+LR筛选的重要基因将被用于后续的生物信息学分析、疾病机制探究以及潜在治疗靶点的识别等研究中,以期进一步推动相关领域的科学进展。

19.1 加载R包

使用rm(list = ls())来清空环境中的所有变量。

library(tidyverse)
library(Biobase)
library(glmnet)
library(caret)
library(ROCR)
library(pROC)
library(MLmetrics)

rm(list = ls())
options(stringsAsFactors = F)
options(future.globals.maxSize = 10000 * 1024^2)

grp_names <- c("Early Stage", "Late Stage")
grp_colors <- c("#8AC786", "#B897CA")
grp_shapes <- c(15, 16)

19.2 导入数据

da_res <- read.csv("./data/result/DA/HCC_Early_vs_Late_limma_select.csv")

ExprSet <- readRDS("./data/result/ExpSetObject/MergeExpSet_VoomSNM_VoomSNM_LIRI-JP_TCGA-LIHC.RDS")

19.3 准备数据


signif_DEG <- da_res %>%
  dplyr::filter(Enrichment != "Nonsignif") %>%
  dplyr::slice(-grep("\\.", FeatureID))

head(signif_DEG[, 1:6])
  • 基因表达谱:行名是样本,列名是基因ID

profile <- exprs(ExprSet) %>%
  as.data.frame()
rownames(profile) <- make.names(rownames(profile))

head(profile_DEG[, 1:6])
  • 临床表型表:包含分组等信息

metadata <- pData(ExprSet)
head(metadata[, 1:6])

19.4 机器学习特征筛选

特征筛选步骤:

  1. 数据分割:

将原始数据集划分为训练集、验证集和测试集。通常,训练集用于模型训练,验证集用于调整超参数和选择最佳模型,测试集用于评估最终模型的性能。 划分比例可以根据数据集的大小和特性进行调整,但一般常见的比例是训练集占60%-80%,验证集和测试集各占10%-20%。

  1. 数据转换:

数据清洗:处理缺失值、异常值、重复值等。对于缺失值,可以使用均值、中位数、众数等填充,或采用插值法、机器学习预测等方法进行填充。

特征编码:对于分类变量,需要进行编码以便模型能够处理。常见的编码方式包括独热编码(One-Hot Encoding)、标签编码(Label Encoding)等。

特征离散化:将连续特征转换为离散类别,例如通过分箱操作。这有助于处理一些具有非线性关系的特征。

特征标准化或归一化:对于不同量纲或不同分布的特征,需要进行标准化或归一化,以便模型能够更好地处理它们。

  1. 特征筛选:

相关性分析:计算每个特征与目标变量之间的相关性,如皮尔逊相关系数或斯皮尔曼等级相关系数。通过相关性分析,可以初步筛选出与目标变量强相关的特征。

树模型重要性:利用决策树、随机森林等树模型评估特征的重要性。这些模型可以根据特征在树中分裂的贡献度来评估特征的重要性。

启发式搜索策略:如前向序列选择方法,从空的候选集合出发,逐步添加与目标变量相关性最强的特征。

全局最优搜索策略:如穷举法或分支界定法,从所有可能的特征组合中挑选出表现最优的特征子集。

  1. 训练集构建模型:使用经过特征筛选的训练集数据构建机器学习模型。选择合适的机器学习算法,如逻辑回归、支持向量机、决策树、随机森林、神经网络等。

  2. 调参:使用验证集对模型进行调参,优化模型的超参数。超参数是机器学习算法中需要人为设定的参数,如学习率、迭代次数、正则化参数等。 可以采用网格搜索(Grid Search)、随机搜索(Random Search)等方法进行调参。

  3. 测试集评估模型:使用测试集对经过调参的模型进行评估,计算模型的性能指标,如准确率、精确率、召回率、F1分数、AUC-ROC等。根据评估结果选择性能最佳的模型作为最终模型。

19.4.1 数据分割

在机器学习的实践中,数据分割是一个至关重要的步骤,用于将原始数据集划分为训练集和测试集(使用了caret::createDataPartition (Kuhn 2008))。


set.seed(123)

trainIndex <- caret::createDataPartition(
          metadata$Group, 
          p = 0.8, 
          list = FALSE, 
          times = 1)

y_train <- metadata[trainIndex, ]
X_train <- profile_DEG[rownames(profile_DEG) %in% rownames(y_train), ] %>%
  as.matrix()

y_test <- metadata[-trainIndex, ]
X_test <- profile_DEG[!rownames(profile_DEG) %in% rownames(y_train), ] %>%
  as.matrix()

19.4.2 调参

在构建机器学习模型时,特别是在使用Lasso+LR时,选择合适的正则化参数(在Lasso中通常称为\(\lambda\))是非常重要的。正则化参数用于控制模型的复杂度,以避免过拟合。交叉验证(Cross-Validation, CV)是一种常用的技术,用于评估模型在不同参数设置下的性能,并选取最优的参数。

cv.glmnet函数中(来自于glmnet(Friedman, Hastie, 和 Tibshirani 2010)),family参数用于指定模型的类型。当family = “binomial”时,你正在指定模型应该使用逻辑回归(logistic regression:LR)来拟合二元(binary)响应变量(即,响应变量只取两个可能的值,如0和1)。


set.seed(123)

cv.fit <- cv.glmnet(
  x = X_train, 
  y = y_train$Group, 
  family = "binomial")

plot(cv.fit)

19.4.3 测试集验证

首先,将训练好的模型应用于测试集数据,以预测每个样本的分类标签。预测结果将与测试集的真实标签进行对比,以计算模型在各个类别上的分类准确性。

  • 预测测试集的标签

test_group_label <- predict(cv.fit, X_test) %>%
  as.data.frame() %>%
  setNames("Predicted") %>%
  cbind(data.frame(Actual = y_test$Group)) %>%
  head()

head(test_group_label)
  • assess.glmnet函数给出整体评估指标

assess.glmnet(cv.fit, newx = X_test, newy = y_test$Group, s = "lambda.min")
  • 混淆矩阵(Confusion matrix)

cnf <- confusion.glmnet(
  object = cv.fit, 
  newx = X_test)

print(cnf)
  • ROC曲线

rocs <- roc.glmnet(
  object = cv.fit$fit.preval,
  newy = y_train$Group)

best <- cv.fit$index["min", ]
plot(rocs[[best]], type = "l")
invisible(sapply(rocs, lines, col="grey"))
lines(rocs[[best]], lwd = 2, col = "red")
lines(c(0, 1), c(0, 1), col = "black", lty = 9, lwd = 2)

19.4.4 最终分类模型


lasso_model <- glmnet(
  X_train, 
  y_train$Group, 
  family = "binomial", 
  alpha = 1, 
  lambda = cv.fit$lambda.min)

ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line(color = "blue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  geom_text(aes(label = paste("Youden Index =", round(youden_index, 2))),
            x = 0.2, y = 0.8, color = "black", size = 4) +
  geom_text(aes(label = paste("Specificity =", round(roc_data$specificity[which.max(roc_data$sensitivity)], 2))),
            x = 0.2, y = 0.7, color = "black", size = 4) +
  labs(x = "False Positive Rate", y = "True Positive Rate",
       title = "ROC Curve for LASSO Regression") +
  theme_bw()
  • 混淆矩阵
print(caret::confusionMatrix(as.factor(pred_raw[, 1]), factor(y_test$Group)))

结果:特异度(Specificity)在上述所有模型评估指标中是偏低的(Specificity : 0.2857),结合先前机器学习基础 小节 4.3,可以推测得知该模型对阴性样本也即是Late Stage的患者预测效果不佳。


AUROC <- function(
    DataTest, 
    PredProb = pred_prob, 
    nfeature) {
  
  # DataTest = y_test
  # PredProb = pred_prob
  # nfeature = 41
  
  # ROC object
  rocobj <- roc(DataTest$Group, PredProb[, 1])
  
  # plot
  pl <- ggplot(data = roc, aes(x = fpr, y = tpr)) +
    geom_path(color = "red", size = 1) +
    geom_abline(intercept = 0, slope = 1, 
                color = "grey", linewidth = 1, linetype = 2) +
    labs(x = "False Positive Rate (1 - Specificity)",
         y = "True Positive Rate",
         title = paste0("AUROC (", nfeature, " Features)")) +
    annotate("text", 
             x = 1 - rocbj_df$specificities[max_value_row] + 0.15, 
             y = rocbj_df$sensitivities[max_value_row] - 0.05, 
             label = paste0(threshold, " (", 
                            rocbj_df$specificities[max_value_row], ",",
                            rocbj_df$sensitivities[max_value_row], ")"),
             size = 5, family = "serif") +
    annotate("point", 
             x = 1 - rocbj_df$specificities[max_value_row], 
             y = rocbj_df$sensitivities[max_value_row], 
             color = "black", size = 2) +    
    annotate("text", 
             x = .75, y = .25, 
             label = roc_CI_lab,
             size = 5, family = "serif") +
    coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
    theme_bw() +
    theme(panel.background = element_rect(fill = "transparent"),
          plot.title = element_text(size = 12, color = "black", face = "bold"),
          axis.title = element_text(size = 11, color = "black", face = "bold"), 
          axis.text = element_text(size= 10, color = "black"),
          axis.ticks.length = unit(0.4, "lines"),
          axis.ticks = element_line(color = "black"),
          axis.line = element_line(size = .5, color = "black"),
          text = element_text(size = 8, color = "black", family = "serif"))
  
  res <- list(rocobj = rocobj,
              roc_CI = roc_CI_lab,
              roc_pl = pl)
  
  return(res)
}

AUROC_res <- AUROC(
    DataTest = y_test, 
    PredProb = pred_prob, 
    nfeature = 20)

AUROC_res$roc_pl

AUPRC <- function(DataTest, PredProb, nfeature) {
  
  # plot
  pl <- ggplot(data = prc, aes(x = recall, y = precision)) +
    geom_path(color = "red", size = 1) +
    labs(x = "Recall",
         y = "Precision",
         title = paste0("AUPRC (", nfeature, " Features)")) +
    coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
    theme_bw() +
    theme(panel.background = element_rect(fill = "transparent"),
          plot.title = element_text(color = "black", size = 14, face = "bold"),
          axis.ticks.length = unit(0.4, "lines"),
          axis.ticks = element_line(color = "black"),
          axis.line = element_line(size = .5, color = "black"),
          axis.title = element_text(color = "black", size = 12, face = "bold"),
          axis.text = element_text(color = "black", size = 10),
          text = element_text(size = 8, color = "black", family = "serif"))
  
  res <- list(dat_PR = dat_PR,
              PC_pl = pl)
  
  return(res)
} 

AUPRC_res <- AUPRC(
    DataTest = y_test, 
    PredProb = pred_prob, 
    nfeature = 20)

AUPRC_res$PC_pl

Evaluate_index <- function(DataTest, PredProb, label, PredRaw) {
  
  threshold <- rocbj_df$threshold[max_value_row]
  sen <- round(TP / (TP + FN), 3) # caret::sensitivity(con_matrix)
  spe <- round(TN / (TN + FP), 3) # caret::specificity(con_matrix)
  acc <- round((TP + TN) / (TP + TN + FP + FN), 3) # Accuracy
  pre <- round(TP / (TP + FP), 3) # precision
  rec <- round(TP / (TP + FN), 3) # recall
  #F1S <- round(2 * TP / (TP + TN + FP + FN + TP - TN), 3)# F1-Score
  F1S <- round(2 * TP / (2 * TP + FP + FN), 3)# F1-Score
  youden <- sen + spe - 1 # youden index
  
  
  # AUROC
  AUROC <- round(as.numeric(auc(DataTest$Group, PredProb[, 1])), 3)
  
  # AUPRC
  AUPRC <- round(MLmetrics::PRAUC(y_pred = PredProb[, 1], 
                                  y_true = DataTest$Group), 3)  
  
  index_df <- data.frame(Index = c("Threshold", "Sensitivity",
                                   "Specificity", "Accuracy",
                                   "Precision", "Recall",
                                   "F1 Score", "Youden index",
                                   "AUROC", "AUPRC"),
                         Value = c(threshold, sen, spe,
                                   acc, pre, rec, F1S, 
                                   youden, AUROC, AUPRC)) %>%
    stats::setNames(c("Index", label))
  
  return(index_df)
}

Evaluate_index(
  DataTest = y_test,
  PredProb = pred_prob,
  label = group_names[1],
  PredRaw = pred_raw)

结果:从表中我们可以看到以下指标及其对应的数值:

  • 阈值(Threshold):0.337。阈值用于将模型的预测结果转换为具体的类别(例如,在二分类问题中,通常将预测概率大于阈值的样本视为正类,小于阈值的视为负类)。

  • 灵敏度(Sensitivity):0.937。也称为真正率(True Positive Rate, TPR)或召回率(Recall),它表示所有实际为正样本的样本中,被模型正确预测为正样本的比例。

  • 特异度(Specificity):0.286。也称为真负率(True Negative Rate, TNR),它表示所有实际为负样本的样本中,被模型正确预测为负样本的比例。

  • 准确率(Accuracy):0.737。它表示模型在所有样本上的正确分类的比例。

  • 精度(Precision):0.747。它表示所有被模型预测为正样本的样本中,实际为正样本的比例。

  • 召回率(Recall):0.937(注意这个与Sensitivity的值是一样的,因为在二分类问题中,Sensitivity和Recall是等价的)。

  • F1得分(F1Score):0.831。它是精度和召回率的调和平均值,用于综合衡量两者的表现。

  • Youden指数:0.223。它是一个用于评估诊断测试性能的指标,等于灵敏度与特异度之和减去1。

  • AUROC(Area Under the Receiver Operating Characteristic Curve):0.732。AUROC是一种衡量分类模型性能的指标,它表示ROC曲线下的面积。

  • AUPRC(Area Under the Precision-Recall Curve):0.573。AUPRC是另一种衡量分类模型性能的指标,特别适用于正样本数量远少于负样本的情况,它表示Precision-Recall曲线下的面积。

19.5 标记基因

在运用Lasso回归进行模型训练的过程中,通过L1正则化的作用,模型能够自动筛选出那些对预测结果具有显著影响的特征,这些特征对应的Lasso系数不为零,因此被称为标记基因或显著特征。


optimal_feature <- coef(cv.fit, s = "lambda.min") %>%
  as.matrix() %>%
  as.data.frame() %>%
  setNames("Coef") %>%
  dplyr::filter(Coef != 0) %>%
  tibble::rownames_to_column("FeatureID") %>%
  dplyr::slice(-1)

head(optimal_feature)
  • 可视化特征

main_theme <- 
  theme(
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.line.x = element_line(linewidth = 0.5, color = "black"),
    axis.line.y = element_line(linewidth = 0.5, color = "black"),
    axis.ticks = element_line(color = "black"),
    axis.title = element_text(size = 11, color = "black", face = "bold"),    
    axis.text = element_text(color = "black", size = 10),
    legend.position = "right",
    legend.background = element_blank(),
    legend.key = element_blank(),
    legend.text = element_text(size = 12),
    text = element_text(family = "serif"))

optimal_feature_pl <- optimal_feature %>%
  ggplot(aes(x = FeatureID, y = Coef))+
    geom_bar(stat = "identity", fill = "white", color = "blue") +
    labs(x = "", y = "Coefficient") +
    coord_flip() +
    main_theme

optimal_feature_pl
  • 提取特征的表达谱

profile_LASSO <- profile[pmatch(optimal_feature$FeatureID,
                              rownames(profile)), ]

head(profile_LASSO[, 1:6])

19.6 输出结果


if (!dir.exists("./data/result/ML/LASSO")) {
  dir.create("./data/result/ML/LASSO", recursive = TRUE)
}

write.csv(optimal_feature, "./data/result/ML/LASSO/HCC_LASSO_feature.csv", row.names = F)
write.table(profile_LASSO, "./data/result/ML/LASSO/HCC_LASSO_profile.tsv", row.names = T, sep = "\t", quote = F)
save(AUROC_res, file = "./data/result/ML/LASSO/HCC_LASSO_AUROC.RData")

if (!dir.exists("./data/result/Figure/")) {
  dir.create("./data/result/Figure/", recursive = TRUE)
}

pdf("./data/result/Figure/SFig3-A.pdf", width = 5, height = 4)
plot(cv.fit)
dev.off()

ggsave("./data/result/Figure/SFig3-B.pdf", AUROC_res$roc_pl, width = 5, height = 4, dpi = 600) 
ggsave("./data/result/Figure/SFig3-C.pdf", optimal_feature_pl, width = 5, height = 4, dpi = 600) 

19.7 总结

采用了一种结合差异基因分析和机器学习(LASSO+LR模型)的方法,来进行基因筛选和分类预测。

系统信息
sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] MLmetrics_1.1.3     pROC_1.18.5         ROCR_1.0-11        
 [4] caret_6.0-94        lattice_0.22-6      glmnet_4.1-8       
 [7] Matrix_1.6-5        Biobase_2.62.0      BiocGenerics_0.48.1
[10] lubridate_1.9.3     forcats_1.0.0       stringr_1.5.1      
[13] dplyr_1.1.4         purrr_1.0.2         readr_2.1.5        
[16] tidyr_1.3.1         tibble_3.2.1        ggplot2_3.5.1      
[19] tidyverse_2.0.0    

loaded via a namespace (and not attached):
 [1] gtable_0.3.5         shape_1.4.6.1        xfun_0.43           
 [4] htmlwidgets_1.6.4    recipes_1.0.10       tzdb_0.4.0          
 [7] vctrs_0.6.5          tools_4.3.3          generics_0.1.3      
[10] stats4_4.3.3         parallel_4.3.3       fansi_1.0.6         
[13] ModelMetrics_1.2.2.2 pkgconfig_2.0.3      data.table_1.15.4   
[16] lifecycle_1.0.4      compiler_4.3.3       munsell_0.5.1       
[19] codetools_0.2-19     htmltools_0.5.8.1    class_7.3-22        
[22] yaml_2.3.8           prodlim_2023.08.28   pillar_1.9.0        
[25] MASS_7.3-60.0.1      gower_1.0.1          iterators_1.0.14    
[28] rpart_4.1.23         foreach_1.5.2        parallelly_1.37.1   
[31] lava_1.8.0           nlme_3.1-164         tidyselect_1.2.1    
[34] digest_0.6.35        future_1.33.2        stringi_1.8.4       
[37] reshape2_1.4.4       listenv_0.9.1        splines_4.3.3       
[40] fastmap_1.1.1        grid_4.3.3           colorspace_2.1-0    
[43] cli_3.6.2            magrittr_2.0.3       survival_3.7-0      
[46] utf8_1.2.4           future.apply_1.11.2  withr_3.0.0         
[49] scales_1.3.0         timechange_0.3.0     rmarkdown_2.26      
[52] globals_0.16.3       nnet_7.3-19          timeDate_4032.109   
[55] hms_1.1.3            evaluate_0.23        knitr_1.46          
[58] hardhat_1.3.1        rlang_1.1.3          Rcpp_1.0.12         
[61] glue_1.7.0           BiocManager_1.30.23  renv_1.0.0          
[64] ipred_0.9-14         rstudioapi_0.16.0    jsonlite_1.8.8      
[67] R6_2.5.1             plyr_1.8.9