Interpretation, Identification and Reuse of Models. Theory and algorithms with applications in predictive toxicology.
AuthorPalczewska, Anna Maria
Ridley, Mick J.
KeywordModel interpretation, Model identification, Model governance, Reuse of models, Predictive toxicology, Random forest model, Feature contributions, Pareto optimality
The University of Bradford theses are licenced under a Creative Commons Licence.
InstitutionUniversity of Bradford
DepartmentSchool of Electrical Engineering and Computer Science
MetadataShow full item record
AbstractThis thesis is concerned with developing methodologies that enable existing models to be effectively reused. Results of this thesis are presented in the framework of Quantitative Structural-Activity Relationship (QSAR) models, but their application is much more general. QSAR models relate chemical structures with their biological, chemical or environmental activity. There are many applications that offer an environment to build and store predictive models. Unfortunately, they do not provide advanced functionalities that allow for efficient model selection and for interpretation of model predictions for new data. This thesis aims to address these issues and proposes methodologies for dealing with three research problems: model governance (management), model identification (selection), and interpretation of model predictions. The combination of these methodologies can be employed to build more efficient systems for model reuse in QSAR modelling and other areas. The first part of this study investigates toxicity data and model formats and reviews some of the existing toxicity systems in the context of model development and reuse. Based on the findings of this review and the principles of data governance, a novel concept of model governance is defined. Model governance comprises model representation and model governance processes. These processes are designed and presented in the context of model management. As an application, minimum information requirements and an XML representation for QSAR models are proposed. Once a collection of validated, accepted and well annotated models is available within a model governance framework, they can be applied for new data. It may happen that there is more than one model available for the same endpoint. Which one to chose? The second part of this thesis proposes a theoretical framework and algorithms that enable automated identification of the most reliable model for new data from the collection of existing models. The main idea is based on partitioning of the search space into groups and assigning a single model to each group. The construction of this partitioning is difficult because it is a bi-criteria problem. The main contribution in this part is the application of Pareto points for the search space partition. The proposed methodology is applied to three endpoints in chemoinformatics and predictive toxicology. After having identified a model for the new data, we would like to know how the model obtained its prediction and how trustworthy it is. An interpretation of model predictions is straightforward for linear models thanks to the availability of model parameters and their statistical significance. For non linear models this information can be hidden inside the model structure. This thesis proposes an approach for interpretation of a random forest classification model. This approach allows for the determination of the influence (called feature contribution) of each variable on the model prediction for an individual data. In this part, there are three methods proposed that allow analysis of feature contributions. Such analysis might lead to the discovery of new patterns that represent a standard behaviour of the model and allow additional assessment of the model reliability for new data. The application of these methods to two standard benchmark datasets from the UCI machine learning repository shows a great potential of this methodology. The algorithm for calculating feature contributions has been implemented and is available as an R package called rfFC.
Showing items related by title, author, creator and subject.
Large-scale 3D environmental modelling and visualisation for flood hazard warning.Wan, Tao Ruan; Palmer, Ian J.; Wang, Chen (University of BradfordDepartment of Creative Technology. School of Computing, Informatics and Media., 2009-08-24)3D environment reconstruction has received great interest in recent years in areas such as city planning, virtual tourism and flood hazard warning. With the rapid development of computer technologies, it has become possible and necessary to develop new methodologies and techniques for real time simulation for virtual environments applications. This thesis proposes a novel dynamic simulation scheme for flood hazard warning. The work consists of three main parts: digital terrain modelling; 3D environmental reconstruction and system development; flood simulation models. The digital terrain model is constructed using real world measurement data of GIS, in terms of digital elevation data and satellite image data. An NTSP algorithm is proposed for very large data assessing, terrain modelling and visualisation. A pyramidal data arrangement structure is used for dealing with the requirements of terrain details with different resolutions. The 3D environmental reconstruction system is made up of environmental image segmentation for object identification, a new shape match method and an intelligent reconstruction system. The active contours-based multi-resolution vector-valued framework and the multi-seed region growing method are both used for extracting necessary objects from images. The shape match method is used with a template in the spatial domain for a 3D detailed small scale urban environment reconstruction. The intelligent reconstruction system is designed to recreate the whole model based on specific features of objects for large scale environment reconstruction. This study then proposes a new flood simulation scheme which is an important application of the 3D environmental reconstruction system. Two new flooding models have been developed. The first one is flood spreading model which is useful for large scale flood simulation. It consists of flooding image spatial segmentation, a water level calculation process, a standard gradient descent method for energy minimization, a flood region search and a merge process. The finite volume hydrodynamic model is built from shallow water equations which is useful for urban area flood simulation. The proposed 3D urban environment reconstruction system was tested on our simulation platform. The experiment results indicate that this method is capable of dealing with complicated and high resolution region reconstruction which is useful for many applications. When testing the 3D flood simulation system, the simulation results are very close to the real flood situation, and this method has faster speed and greater accuracy of simulating the inundation area in comparison to the conventional flood simulation models
Modelling and stochastic simulation of synthetic biological Boolean gatesSanassy, D.; Fellerman, H.; Krasnogor, N.; Konur, Savas; Mierla, L.M.; Gheorghe, Marian; Ladroue, C.; Kalvala, S. (2014)Synthetic Biology aspires to design, compose and engineer biological systems that implement specified behaviour. When designing such systems, hypothesis testing via computational modelling and simulation is vital in order to reduce the need of costly wet lab experiments. As a case study, we discuss the use of computational modelling and stochastic simulation for engineered genetic circuits that implement Boolean AND and OR gates that have been reported in the literature. We present performance analysis results for nine different state-of-the-art stochastic simulation algorithms and analyse the dynamic behaviour of the proposed gates. Stochastic simulations verify the desired functioning of the proposed gate designs.