Feature Reduction of Lung Cancer Microarray Data Using Mutual Information Selection and PyCaret-Supported Recursive Feature Elimination
DOI:
https://doi.org/10.11594/nstp.2023.3701Keywords:
Lung cancer, microarray data, feature reduction, mutual information feature selection, recursive feature elimination, PyCaretAbstract
Lung cancer remains a leading cause of cancer-related mortality worldwide, and Indonesia's ever-increasing amount of pollution signals an urgency for improvement in lung cancer early detection. One of the methods to detect lung cancer is molecular diagnosis using DNA microarray, which has been proven to be effective. However, the complexity of microarray data with a vast number of features hinders the timely and accurate detection of lung cancer. This study seeks to optimize the features of the data to improve classification performance. Our approach combines Mutual Information Feature Selection with Recursive Feature Elimination, leveraging the PyCaret library to train and evaluate machine learning models. The process involves initial feature reduction using Mutual Information to enhance computational efficiency, followed by training machine learning models with PyCaret. The two best-performing models for each dataset are used to perform recursive feature elimination to search for the most optimal feature. A support vector machine is also used for comparison. The final output will be three subsets of features and another subset that consists of combined features of the rest of other subsets. Finally, PyCaret will be utilized again to train machine learning models with all feature subsets. The study shows that other models can select fewer features compared to the Support Vector Machine and still maintain a powerful predictive power with high accuracy (95% - 98%). In conclusion, our research offers a new approach to selecting optimal features for microarray analysis, with implications for more effective and timely cancer diagnosis.
Downloads
Downloads
Published
Conference Proceedings Volume
Section
License
Copyright (c) 2023 Andrew Jonathan Brahms Simangunsong, Valha Tsabita Hidayat

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this proceedings agree to the following terms:
Authors retain copyright and grant the Nusantara Science and Technology Proceedings right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this proceeding.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the proceedings published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this proceeding.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See the Effect of Open Access).