Description
Accurate cancer subtype identification is fundamental to precision medicine and personalized treatment planning. While DNA methylation has emerged as a powerful epigenetic biomarker for cancer subtyping, the lack of labeled methylome datasets has limited its practical application. This research paper presents meth-SemiCancer, a novel semi-supervised learning framework designed to overcome data scarcity in DNA methylation–based cancer subtype classification.
The proposed approach first pre-trains a neural network model using labeled DNA methylation datasets from The Cancer Genome Atlas. It then assigns pseudo-labels to unlabeled methylation datasets sourced from public repositories such as GEO. Through iterative fine-tuning using both labeled and pseudo-labeled data, meth-SemiCancer improves classification robustness and reduces overfitting.
The framework is evaluated across six cancer types, including breast, colon, glioblastoma, prostate, renal, and thyroid cancers. Experimental results demonstrate that meth-SemiCancer consistently outperforms traditional machine-learning classifiers such as SVM, Random Forest, and KNN, achieving higher F1-scores and Matthews correlation coefficients. The study further shows that incorporating unlabeled samples significantly enhances model generalization.
This paper is highly valuable for researchers in bioinformatics, computational biology, epigenetics, and medical AI. It provides a practical and extensible methodology for exploiting large unlabeled methylation datasets to improve cancer subtype classification.
