Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files
Lefoane, Moemedi ; Ghafir, Ibrahim ; ; ; El Hindi, K. ; Mahendran, A.
Lefoane, Moemedi
Ghafir, Ibrahim
El Hindi, K.
Mahendran, A.
Publication Date
2025-08-12
End of Embargo
Supervisor
Rights
(c) 2025 The Authors. This is an Open Access article distributed under the Creative Commons CC-BY license (https://creativecommons.org/licenses/by/4.0/)
Peer-Reviewed
Yes
Open Access status
openAccess
Accepted for publication
2025-07-29
Institution
Department
Awarded
Embargo end date
Additional title
Abstract
During a typical cyber-attack lifecycle, several key phases are involved, including footprinting and reconnaissance, scanning, exploitation, and covering tracks. The successful delivery of a payload lies at the heart of ensuring the effectiveness of cyberattacks, which is typically executed following the exploitation of vulnerabilities. This allows adversaries to gain backdoor access to their target and accomplish their objectives. With the increasing use of generative Artificial Intelligence (AI), adversaries are leveraging AI to enhance their attack strategies, ranging from creating more credible phishing attacks and social engineering tactics to developing advanced viruses that are delivered through various means such as phishing attacks. Efforts to devise AI techniques for the detection of malicious executable files have garnered significant attention in the research community. While numerous Machine Learning (ML) techniques have been proposed for this purpose, a significant challenge arises due to the memory requirements for storing the extracted features. These features, resembling unstructured vocabulary features in natural language processing, need to be converted into a rectangular matrix for input into a classification model during training. The resulting matrix is sparse and its size depends on the unique features extracted, leading to a substantial increase in memory requirements, posing a significant challenge. This research proposes a novel ML-based intrusion detection system designed for the detection of malicious executable files. The proposed system utilises each of Non-Negative Matrix Factorization (NMF) and Latent Semantic Analysis (LSA) as an individual technique for feature selection. In addition to these two individual techniques, this system introduces a hybrid feature selection approach that combines both NMF and LSA. The proposed system was assessed using a dataset containing benign and malicious executable files, yielding a performance accuracy of over 96% and False Positive Rate (FPR) score of less than 2.2% across several ML models.
Version
Published version
Citation
Lefoane M, Ghafir I, Kabir S et al (2025) Non-Negative Matrix Factorization and Latent Semantic Analysis for Hybrid Feature Selection: A Proposed Machine Learning System for the Detection of Malicious Executable Files. IEEE Access. 13: 138867-138882.
Link to publisher’s version
Link to published version
Link to Version of Record
Type
Article
