22 交集特征
在数据分析和机器学习项目中,特征选择是一个至关重要的步骤,它有助于识别数据集中与目标变量最相关的特征。当通过不同的机器学习方法筛选出重要特征时,对这些特征取交集以识别核心特征,是一种有效的策略,能够确保我们专注于那些在不同模型中都表现出显著影响的特征。
接下来,我们将三种方法筛选出的重要特征进行交集计算。这意味着我们要找出在三种不同特征选择方法中都被认定为重要的那些特征。这些特征的集合就是我们所谓的“核心特征”。
22.1 加载R包
使用rm(list = ls())
来清空环境中的所有变量。
22.2 导入数据
22.3 重叠的重要特征
在通过三种不同的机器学习方法筛选特征后,取这些特征集合的交集。位于这个交集内的基因,将其视为具有显著重要性的特征,这些特征对于所研究的问题具有共同的、关键的影响。
feature_list <- list(
LASSO = LASSO_feature$FeatureID,
RF_Boruta = RF_feature$FeatureID,
SVM_RFE = SVM_feature$FeatureID
)
over_LASSO_RF <- intersect(LASSO_feature$FeatureID,
RF_feature$FeatureID)
over_SVM_RF <- intersect(SVM_feature$FeatureID,
RF_feature$FeatureID)
over_LASSO_SVM <- intersect(LASSO_feature$FeatureID,
SVM_feature$FeatureID)
over_gene_three <- df_int_gene %>%
dplyr::filter(int %in% c("LASSO|RF_Boruta|SVM_RFE"))
head(over_gene_three)
22.4 重要特征的韦恩图
- 采用UpSetR(Conway, Lex, 和 Gehlenborg 2017)R包画交集图
upset_pl <- UpSetR::upset(
data = UpSetR::fromList(feature_list),
nsets = 3,
sets = c("LASSO", "RF_Boruta", "SVM_RFE"),
sets.bar.color = c("#CD534CFF", "#EFC000FF", "#0073C2FF"))
upset_pl
- 采用ggvenn(Yan 和 Yan 2021)R包画交集图
22.5 输出结果
if (!dir.exists("./data/result/Biomarker/")) {
dir.create("./data/result/Biomarker/", recursive = TRUE)
}
write.csv(over_gene_three, "./data/result/Biomarker/Biomarker_LR_RF_SVM.csv", row.names = F)
if (!dir.exists("./data/result/Figure/")) {
dir.create("./data/result/Figure/", recursive = TRUE)
}
pdf("./data/result/Figure/Fig5-A1.pdf", width = 7, height = 5, onefile = FALSE)
upset_pl
dev.off()
ggsave("./data/result/Figure/Fig5-A2.pdf", venn_pl, width = 5, height = 4, dpi = 600)
22.6 总结
采用了三种不同的机器学习算法来筛选与表型或疾病状态紧密相关的基因特征。这三种算法分别是LASSO+LR]、Boruta+RF和REF+SVM*,它们各自基于不同的统计原理和假设,以识别与目标变量显著相关的基因。
系统信息
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Asia/Shanghai
tzcode source: internal
attached base packages:
[1] grid stats graphics grDevices datasets utils methods
[8] base
other attached packages:
[1] UpSetR_1.4.0 ggvenn_0.1.10 data.table_1.15.4 lubridate_1.9.3
[5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.5 jsonlite_1.8.8 compiler_4.3.3
[4] BiocManager_1.30.23 renv_1.0.0 Rcpp_1.0.12
[7] tidyselect_1.2.1 gridExtra_2.3 scales_1.3.0
[10] yaml_2.3.8 fastmap_1.1.1 plyr_1.8.9
[13] R6_2.5.1 generics_0.1.3 knitr_1.46
[16] htmlwidgets_1.6.4 munsell_0.5.1 tzdb_0.4.0
[19] pillar_1.9.0 rlang_1.1.3 utf8_1.2.4
[22] stringi_1.8.4 xfun_0.43 timechange_0.3.0
[25] cli_3.6.2 withr_3.0.0 magrittr_2.0.3
[28] digest_0.6.35 rstudioapi_0.16.0 hms_1.1.3
[31] lifecycle_1.0.4 vctrs_0.6.5 evaluate_0.23
[34] glue_1.7.0 fansi_1.0.6 colorspace_2.1-0
[37] rmarkdown_2.26 tools_4.3.3 pkgconfig_2.0.3
[40] htmltools_0.5.8.1