Contributions to evaluation of machine learning models. Applicability domain of classification models

View/ Open
PhD Thesis (2.487Mb)
Download
Publication date
2019Author
Rado, Omesaad A.M.Supervisor
Neagu, DanielKeyword
Machine learningClassification algorithms
Binary classification
Accuracy
Model evaluation
Model reliability
Applicability domain
Model robustness
Model coverage
Healthcare data
Rights

The University of Bradford theses are licenced under a Creative Commons Licence.
Institution
University of BradfordDepartment
Faculty of Engineering and InformaticsAwarded
2019
Metadata
Show full item recordAbstract
Artificial intelligence (AI) and machine learning (ML) present some application opportunities and challenges that can be framed as learning problems. The performance of machine learning models depends on algorithms and the data. Moreover, learning algorithms create a model of reality through learning and testing with data processes, and their performance shows an agreement degree of their assumed model with reality. ML algorithms have been successfully used in numerous classification problems. With the developing popularity of using ML models for many purposes in different domains, the validation of such predictive models is currently required more formally. Traditionally, there are many studies related to model evaluation, robustness, reliability, and the quality of the data and the data-driven models. However, those studies do not consider the concept of the applicability domain (AD) yet. The issue is that the AD is not often well defined, or it is not defined at all in many fields. This work investigates the robustness of ML classification models from the applicability domain perspective. A standard definition of applicability domain regards the spaces in which the model provides results with specific reliability. The main aim of this study is to investigate the connection between the applicability domain approach and the classification model performance. We are examining the usefulness of assessing the AD for the classification model, i.e. reliability, reuse, robustness of classifiers. The work is implemented using three approaches, and these approaches are conducted in three various attempts: firstly, assessing the applicability domain for the classification model; secondly, investigating the robustness of the classification model based on the applicability domain approach; thirdly, selecting an optimal model using Pareto optimality. The experiments in this work are illustrated by considering different machine learning algorithms for binary and multi-class classifications for healthcare datasets from public benchmark data repositories. In the first approach, the decision trees algorithm (DT) is used for the classification of data in the classification stage. The feature selection method is applied to choose features for classification. The obtained classifiers are used in the third approach for selection of models using Pareto optimality. The second approach is implemented using three steps; namely, building classification model; generating synthetic data; and evaluating the obtained results. The results obtained from the study provide an understanding of how the proposed approach can help to define the model’s robustness and the applicability domain, for providing reliable outputs. These approaches open opportunities for classification data and model management. The proposed algorithms are implemented through a set of experiments on classification accuracy of instances, which fall in the domain of the model. For the first approach, by considering all the features, the highest accuracy obtained is 0.98, with thresholds average of 0.34 for Breast cancer dataset. After applying recursive feature elimination (RFE) method, the accuracy is 0.96% with 0.27 thresholds average. For the robustness of the classification model based on the applicability domain approach, the minimum accuracy is 0.62% for Indian Liver Patient data at r=0.10, and the maximum accuracy is 0.99% for Thyroid dataset at r=0.10. For the selection of an optimal model using Pareto optimality, the optimally selected classifier gives the accuracy of 0.94% with 0.35 thresholds average. This research investigates critical aspects of the applicability domain as related to the robustness of classification ML algorithms. However, the performance of machine learning techniques depends on the degree of reliable predictions of the model. In the literature, the robustness of the ML model can be defined as the ability of the model to provide the testing error close to the training error. Moreover, the properties can describe the stability of the model performance when being tested on the new datasets. Concluding, this thesis introduced the concept of applicability domain for classifiers and tested the use of this concept with some case studies on health-related public benchmark datasets.Type
ThesisQualification name
PhDCollections
Related items
Showing items related by title, author, creator and subject.
-
Hybrid Dynamic Modelling of Engine Emissions on Multi-Physics Simulation Platform. A Framework Combining Dynamic and Statistical Modelling to Develop Surrogate Models of System of Internal Combustion Engine for Emission ModellingCampean, Felician; Neagu, Daniel; Pant, Gaurav (University of BradfordFaculty of Engineering and Informatics, 2018)The data-driven models used for the design of powertrain controllers are typically based on the data obtained from steady-state experiments. However, they are only valid under stable conditions and do not provide any information on the dynamic behaviour of the system. In order to capture this behaviour, dynamic modelling techniques are intensively studied to generate alternative solutions for engine mapping and calibration problem, aiming to address the need to increase productivity (reduce development time) and to develop better models for the actual behaviour of the engine under real-world conditions. In this thesis, a dynamic modelling approach is presented undertaken for the prediction of NOx emissions for a 2.0 litre Diesel engine, based on a coupled pre-validated virtual Diesel engine model (GT- Suite ® 1-D air path model) and in-cylinder combustion model (CMCL ® Stochastic Reactor Model Engine Suite). In the context of the considered Engine Simulation Framework, GT Suite + Stochastic Reactor Model (SRM), one fundamental problem is to establish a real time stochastic simulation capability. This problem can be addressed by replacing the slow combustion chemistry solver (SRM) with an appropriate NOx surrogate model. The approach taken in this research for the development of this surrogate model was based on a combination of design of dynamic experiments run on the virtual diesel engine model (GT- Suite), with a dynamic model fitted for the parameters required as input to the SRM, with a zonal design of experiments (DoEs), using Optimal Latin Hypercubes (OLH), run on the SRM model. A response surface model was fitted on the predicted NOx from the SRM OLH DoE data. This surrogate NOx model was then used to replace the computationally expensive SRM simulation, enabling real-time simulations of transient drive cycles to be executed. The performance of the approach was validated on a simulated NEDC drive cycle, against experimental data collected for the engine case study. The capability of methodology to capture the transient trends of the system shows promising results and will be used for the development of global surrogate prediction models for engine-out emissions.
-
Interpretation, Identification and Reuse of Models. Theory and algorithms with applications in predictive toxicology.Neagu, Daniel; Ridley, Mick J.; Travis, Kim; Palczewska, Anna Maria (University of BradfordSchool of Electrical Engineering and Computer Science, 2015-07-15)This thesis is concerned with developing methodologies that enable existing models to be effectively reused. Results of this thesis are presented in the framework of Quantitative Structural-Activity Relationship (QSAR) models, but their application is much more general. QSAR models relate chemical structures with their biological, chemical or environmental activity. There are many applications that offer an environment to build and store predictive models. Unfortunately, they do not provide advanced functionalities that allow for efficient model selection and for interpretation of model predictions for new data. This thesis aims to address these issues and proposes methodologies for dealing with three research problems: model governance (management), model identification (selection), and interpretation of model predictions. The combination of these methodologies can be employed to build more efficient systems for model reuse in QSAR modelling and other areas. The first part of this study investigates toxicity data and model formats and reviews some of the existing toxicity systems in the context of model development and reuse. Based on the findings of this review and the principles of data governance, a novel concept of model governance is defined. Model governance comprises model representation and model governance processes. These processes are designed and presented in the context of model management. As an application, minimum information requirements and an XML representation for QSAR models are proposed. Once a collection of validated, accepted and well annotated models is available within a model governance framework, they can be applied for new data. It may happen that there is more than one model available for the same endpoint. Which one to chose? The second part of this thesis proposes a theoretical framework and algorithms that enable automated identification of the most reliable model for new data from the collection of existing models. The main idea is based on partitioning of the search space into groups and assigning a single model to each group. The construction of this partitioning is difficult because it is a bi-criteria problem. The main contribution in this part is the application of Pareto points for the search space partition. The proposed methodology is applied to three endpoints in chemoinformatics and predictive toxicology. After having identified a model for the new data, we would like to know how the model obtained its prediction and how trustworthy it is. An interpretation of model predictions is straightforward for linear models thanks to the availability of model parameters and their statistical significance. For non linear models this information can be hidden inside the model structure. This thesis proposes an approach for interpretation of a random forest classification model. This approach allows for the determination of the influence (called feature contribution) of each variable on the model prediction for an individual data. In this part, there are three methods proposed that allow analysis of feature contributions. Such analysis might lead to the discovery of new patterns that represent a standard behaviour of the model and allow additional assessment of the model reliability for new data. The application of these methods to two standard benchmark datasets from the UCI machine learning repository shows a great potential of this methodology. The algorithm for calculating feature contributions has been implemented and is available as an R package called rfFC.
-
Interpreting random forest models using a feature contribution methodPalczewska, Anna Maria; Palczewski, J.; Marchese-Robinson, R.M.; Neagu, Daniel (2013)