Interpreting random forest models using a feature contribution method
KeywordVegetation; Computational modelling; Training; Predictive models; Data models; Mathematical model; Analytical models
MetadataShow full item record
VersionNo full-text in the repository
CitationPalczewska A, Palczewski J, Robinson RM et al (2013) Interpreting random forest models using a feature contribution method. In: 2013 IEEE 14th International Conference on Information Reuse and Integration (IRI). 4-16 Aug, 2013: 112-119.
Link to publisher’s versionhttp://dx.doi.org/10.1109/IRI.2013.6642461
Showing items related by title, author, creator and subject.
Hybrid Dynamic Modelling of Engine Emissions on Multi-Physics Simulation Platform. A Framework Combining Dynamic and Statistical Modelling to Develop Surrogate Models of System of Internal Combustion Engine for Emission ModellingCampean, I. Felician; Neagu, Daniel; Pant, Gaurav (University of BradfordFaculty of Engineering and Informatics, 2018)
Interpretation, Identification and Reuse of Models. Theory and algorithms with applications in predictive toxicology.Neagu, Daniel; Ridley, Mick J.; Travis, Kim; Palczewska, Anna Maria (University of BradfordSchool of Electrical Engineering and Computer Science, 2015-07-15)This thesis is concerned with developing methodologies that enable existing models to be effectively reused. Results of this thesis are presented in the framework of Quantitative Structural-Activity Relationship (QSAR) models, but their application is much more general. QSAR models relate chemical structures with their biological, chemical or environmental activity. There are many applications that offer an environment to build and store predictive models. Unfortunately, they do not provide advanced functionalities that allow for efficient model selection and for interpretation of model predictions for new data. This thesis aims to address these issues and proposes methodologies for dealing with three research problems: model governance (management), model identification (selection), and interpretation of model predictions. The combination of these methodologies can be employed to build more efficient systems for model reuse in QSAR modelling and other areas. The first part of this study investigates toxicity data and model formats and reviews some of the existing toxicity systems in the context of model development and reuse. Based on the findings of this review and the principles of data governance, a novel concept of model governance is defined. Model governance comprises model representation and model governance processes. These processes are designed and presented in the context of model management. As an application, minimum information requirements and an XML representation for QSAR models are proposed. Once a collection of validated, accepted and well annotated models is available within a model governance framework, they can be applied for new data. It may happen that there is more than one model available for the same endpoint. Which one to chose? The second part of this thesis proposes a theoretical framework and algorithms that enable automated identification of the most reliable model for new data from the collection of existing models. The main idea is based on partitioning of the search space into groups and assigning a single model to each group. The construction of this partitioning is difficult because it is a bi-criteria problem. The main contribution in this part is the application of Pareto points for the search space partition. The proposed methodology is applied to three endpoints in chemoinformatics and predictive toxicology. After having identified a model for the new data, we would like to know how the model obtained its prediction and how trustworthy it is. An interpretation of model predictions is straightforward for linear models thanks to the availability of model parameters and their statistical significance. For non linear models this information can be hidden inside the model structure. This thesis proposes an approach for interpretation of a random forest classification model. This approach allows for the determination of the influence (called feature contribution) of each variable on the model prediction for an individual data. In this part, there are three methods proposed that allow analysis of feature contributions. Such analysis might lead to the discovery of new patterns that represent a standard behaviour of the model and allow additional assessment of the model reliability for new data. The application of these methods to two standard benchmark datasets from the UCI machine learning repository shows a great potential of this methodology. The algorithm for calculating feature contributions has been implemented and is available as an R package called rfFC.
Design and Operation of Multistage Flash (MSF) Desalination: Advanced Control Strategies and Impact of Fouling. Design operation and control of multistage flash desalination processes: dynamic modelling of fouling, effect of non-condensable gases on venting system design and implementation of GMC and fuzzy controlMujtaba, Iqbal M.; Alsadaie, Salih M.M. (University of BradfordFaculty of Engineering and Informatics, 2017)The rapid increase in the demand on fresh water due the increase in the world population and scarcity of natural water puts more stress on the desalination industrial sector to install more desalination plants around the world. Among these desalination plants, multistage flash desalination process (MSF) is considered to be the most reliable technique of producing potable water from saline water. In recent years, however, the MSF process is confronting many problems to cut off the cost and increase its performance. Among these problems are the non-condensable gases (NCGs) and the accumulation of fouling which they work as heat insulation materials. As a result, the MSF pumps and the heat transfer equipment are overdesigned and consequently increase the capital cost and decrease the performance of the plants. Moreover, improved process control is a cost effective approach to energy conservation and increased process profitability. Thus, this study is motivated by the real absence of detailed kinetic fouling model and implementation of advance process control (APC). To accomplish the above tasks, commercial modelling tools can be utilized to model and simulate MSF process taking into account the NCGs and fouling effect, and optimum control strategy. In this research, gPROMS (general PROcess Modeling System) model builder has been used to develop the MSF process model. First, a dynamic mathematical model of MSF is developed based on the basic laws of mass balance, energy balance and heat transfer. Physical and thermodynamic properties of brine, distillate and water vapour are included to support the model. The model simulation results are validated against actual plant data published in the literature and good agreement with these data is obtained. Second, the design of venting system in MSF plant and the effect of NCGs on the overall heat transfer coefficient (OHTC) are studied. The release rate of NCGs is studied using Henry’s law and the locations of venting points are optimised. The results reveal that high concentration of NCGs heavily affects the OHTC. Furthermore, advance control strategy namely: generic model control (GMC) is designed and introduced to the MSF process to control and track the set points of the two most important variables in the MSF plant; namely the Top Brine Temperature (TBT) which is the output temperature of the brine heater and the Brine Level (BL) in the last stage. The results are compared to conventional Proportional Integral Derivative Controller (PID) and show that GMC controller provides better performance over conventional PID controller to handle a nonlinear system. In addition, a new control strategy called hybrid Fuzzy-GMC is developed and implemented to control the same aforementioned loops. Its results reveal that the new control outperforms the pure GMC in some areas. Finally, a dynamic fouling model is developed and incorporated into the MSF dynamic process model to predict fouling at high temperature and high velocity. The proposed dynamic model considers the attachment and removal mechanisms of calcium carbonate and magnesium hydroxide with more relaxation of the assumptions. Since the MSF plant stages work as a series of heat exchangers, there is a continuous change of temperature, heat flux and salinity of the seawater. The proposed model predicts the behaviour of fouling based on the physical and thermal conditions of every single stage of the plant.