• Computational Approaches for Time Series Analysis and Prediction. Data-Driven Methods for Pseudo-Periodical Sequences.

      Neagu, Daniel; Lan, Yang (University of BradfordSchool of Computing, Informatics & Media, 2010-05-28)
      Time series data mining is one branch of data mining. Time series analysis and prediction have always played an important role in human activities and natural sciences. A Pseudo-Periodical time series has a complex structure, with fluctuations and frequencies of the times series changing over time. Currently, Pseudo-Periodicity of time series brings new properties and challenges to time series analysis and prediction. This thesis proposes two original computational approaches for time series analysis and prediction: Moving Average of nth-order Difference (MANoD) and Series Features Extraction (SFE). Based on data-driven methods, the two original approaches open new insights in time series analysis and prediction contributing with new feature detection techniques. The proposed algorithms can reveal hidden patterns based on the characteristics of time series, and they can be applied for predicting forthcoming events. This thesis also presents the evaluation results of proposed algorithms on various pseudo-periodical time series, and compares the predicting results with classical time series prediction methods. The results of the original approaches applied to real world and synthetic time series are very good and show that the contributions open promising research directions.
    • Enhancing association rules algorithms for mining distributed databases. Integration of fast BitTable and multi-agent association rules mining in distributed medical databases for decision support.

      Dahal, Keshav P.; Abdo, Walid A.A. (University of BradfordDepartment of Computing, 2013-11-04)
      Over the past few years, mining data located in heterogeneous and geographically distributed sites have been designated as one of the key important issues. Loading distributed data into centralized location for mining interesting rules is not a good approach. This is because it violates common issues such as data privacy and it imposes network overheads. The situation becomes worse when the network has limited bandwidth which is the case in most of the real time systems. This has prompted the need for intelligent data analysis to discover the hidden information in these huge amounts of distributed databases. In this research, we present an incremental approach for building an efficient Multi-Agent based algorithm for mining real world databases in geographically distributed sites. First, we propose the Distributed Multi-Agent Association Rules algorithm (DMAAR) to minimize the all-to-all broadcasting between distributed sites. Analytical calculations show that DMAAR reduces the algorithm complexity and minimizes the message communication cost. The proposed Multi-Agent based algorithm complies with the Foundation for Intelligent Physical Agents (FIPA), which is considered as the global standards in communication between agents, thus, enabling the proposed algorithm agents to cooperate with other standard agents. Second, the BitTable Multi-Agent Association Rules algorithm (BMAAR) is proposed. BMAAR includes an efficient BitTable data structure which helps in compressing the database thus can easily fit into the memory of the local sites. It also includes two BitWise AND/OR operations for quick candidate itemsets generation and support counting. Moreover, the algorithm includes three transaction trimming techniques to reduce the size of the mined data. Third, we propose the Pruning Multi-Agent Association Rules algorithm (PMAAR) which includes three candidate itemsets pruning techniques for reducing the large number of generated candidate itemsets, consequently, reducing the total time for the mining process. The proposed PMAAR algorithm has been compared with existing Association Rules algorithms against different benchmark datasets and has proved to have better performance and execution time. Moreover, PMAAR has been implemented on real world distributed medical databases obtained from more than one hospital in Egypt to discover the hidden Association Rules in patients¿ records to demonstrate the merits and capabilities of the proposed model further. Medical data was anonymously obtained without the patients¿ personal details. The analysis helped to identify the existence or the absence of the disease based on minimum number of effective examinations and tests. Thus, the proposed algorithm can help in providing accurate medical decisions based on cost effective treatments, improving the medical service for the patients, reducing the real time response for the health system and improving the quality of clinical decision making.
    • Enhancing Fuzzy Associative Rule Mining Approaches for Improving Prediction Accuracy. Integration of Fuzzy Clustering, Apriori and Multiple Support Approaches to Develop an Associative Classification Rule Base

      Dahal, Keshav P.; Hossain, M. Alamgir; Sowan, Bilal I. (University of BradfordSchool of Computing, Informatics & Media, 2012-02-13)
      Building an accurate and reliable model for prediction for different application domains, is one of the most significant challenges in knowledge discovery and data mining. This thesis focuses on building and enhancing a generic predictive model for estimating a future value by extracting association rules (knowledge) from a quantitative database. This model is applied to several data sets obtained from different benchmark problems, and the results are evaluated through extensive experimental tests. The thesis presents an incremental development process for the prediction model with three stages. Firstly, a Knowledge Discovery (KD) model is proposed by integrating Fuzzy C-Means (FCM) with Apriori approach to extract Fuzzy Association Rules (FARs) from a database for building a Knowledge Base (KB) to predict a future value. The KD model has been tested with two road-traffic data sets. Secondly, the initial model has been further developed by including a diversification method in order to improve a reliable FARs to find out the best and representative rules. The resulting Diverse Fuzzy Rule Base (DFRB) maintains high quality and diverse FARs offering a more reliable and generic model. The model uses FCM to transform quantitative data into fuzzy ones, while a Multiple Support Apriori (MSapriori) algorithm is adapted to extract the FARs from fuzzy data. The correlation values for these FARs are calculated, and an efficient orientation for filtering FARs is performed as a post-processing method. The FARs diversity is maintained through the clustering of FARs, based on the concept of the sharing function technique used in multi-objectives optimization. The best and the most diverse FARs are obtained as the DFRB to utilise within the Fuzzy Inference System (FIS) for prediction. The third stage of development proposes a hybrid prediction model called Fuzzy Associative Classification Rule Mining (FACRM) model. This model integrates the ii improved Gustafson-Kessel (G-K) algorithm, the proposed Fuzzy Associative Classification Rules (FACR) algorithm and the proposed diversification method. The improved G-K algorithm transforms quantitative data into fuzzy data, while the FACR generate significant rules (Fuzzy Classification Association Rules (FCARs)) by employing the improved multiple support threshold, associative classification and vertical scanning format approaches. These FCARs are then filtered by calculating the correlation value and the distance between them. The advantage of the proposed FACRM model is to build a generalized prediction model, able to deal with different application domains. The validation of the FACRM model is conducted using different benchmark data sets from the University of California, Irvine (UCI) of machine learning and KEEL (Knowledge Extraction based on Evolutionary Learning) repositories, and the results of the proposed FACRM are also compared with other existing prediction models. The experimental results show that the error rate and generalization performance of the proposed model is better in the majority of data sets with respect to the commonly used models. A new method for feature selection entitled Weighting Feature Selection (WFS) is also proposed. The WFS method aims to improve the performance of FACRM model. The prediction performance is improved by minimizing the prediction error and reducing the number of generated rules. The prediction results of FACRM by employing WFS have been compared with that of FACRM and Stepwise Regression (SR) models for different data sets. The performance analysis and comparative study show that the proposed prediction model provides an effective approach that can be used within a decision support system.
    • A new model for worm detection and response. Development and evaluation of a new model based on knowledge discovery and data mining techniques to detect and respond to worm infection by integrating incident response, security metrics and apoptosis.

      Cullen, Andrea J.; Woodward, Mike E.; Mohd Saudi, Madihah (University of BradfordDepartment of Computing, School of Computing, Informatics and Media, 2012-04-17)
      Worms have been improved and a range of sophisticated techniques have been integrated, which make the detection and response processes much harder and longer than in the past. Therefore, in this thesis, a STAKCERT (Starter Kit for Computer Emergency Response Team) model is built to detect worms attack in order to respond to worms more efficiently. The novelty and the strengths of the STAKCERT model lies in the method implemented which consists of STAKCERT KDD processes and the development of STAKCERT worm classification, STAKCERT relational model and STAKCERT worm apoptosis algorithm. The new concept introduced in this model which is named apoptosis, is borrowed from the human immunology system has been mapped in terms of a security perspective. Furthermore, the encouraging results achieved by this research are validated by applying the security metrics for assigning the weight and severity values to trigger the apoptosis. In order to optimise the performance result, the standard operating procedures (SOP) for worm incident response which involve static and dynamic analyses, the knowledge discovery techniques (KDD) in modeling the STAKCERT model and the data mining algorithms were used. This STAKCERT model has produced encouraging results and outperformed comparative existing work for worm detection. It produces an overall accuracy rate of 98.75% with 0.2% for false positive rate and 1.45% is false negative rate. Worm response has resulted in an accuracy rate of 98.08% which later can be used by other researchers as a comparison with their works in future.
    • Phishing website detection using intelligent data mining techniques. Design and development of an intelligent association classification mining fuzzy based scheme for phishing website detection with an emphasis on E-banking.

      Hossain, M. Alamgir; Dahal, Keshav P.; Thabtah, Fadi; Abur-rous, Maher Ragheb Mohammed (University of BradfordDepartment of Computing, 2011-05-06)
      Phishing techniques have not only grown in number, but also in sophistication. Phishers might have a lot of approaches and tactics to conduct a well-designed phishing attack. The targets of the phishing attacks, which are mainly on-line banking consumers and payment service providers, are facing substantial financial loss and lack of trust in Internet-based services. In order to overcome these, there is an urgent need to find solutions to combat phishing attacks. Detecting phishing website is a complex task which requires significant expert knowledge and experience. So far, various solutions have been proposed and developed to address these problems. Most of these approaches are not able to make a decision dynamically on whether the site is in fact phished, giving rise to a large number of false positives. This is mainly due to limitation of the previously proposed approaches, for example depending only on fixed black and white listing database, missing of human intelligence and experts, poor scalability and their timeliness. In this research we investigated and developed the application of an intelligent fuzzy-based classification system for e-banking phishing website detection. The main aim of the proposed system is to provide protection to users from phishers deception tricks, giving them the ability to detect the legitimacy of the websites. The proposed intelligent phishing detection system employed Fuzzy Logic (FL) model with association classification mining algorithms. The approach combined the capabilities of fuzzy reasoning in measuring imprecise and dynamic phishing features, with the capability to classify the phishing fuzzy rules. Different phishing experiments which cover all phishing attacks, motivations and deception behaviour techniques have been conducted to cover all phishing concerns. A layered fuzzy structure has been constructed for all gathered and extracted phishing website features and patterns. These have been divided into 6 criteria and distributed to 3 layers, based on their attack type. To reduce human knowledge intervention, Different classification and association algorithms have been implemented to generate fuzzy phishing rules automatically, to be integrated inside the fuzzy inference engine for the final phishing detection. Experimental results demonstrated that the ability of the learning approach to identify all relevant fuzzy rules from the training data set. A comparative study and analysis showed that the proposed learning approach has a higher degree of predictive and detective capability than existing models. Experiments also showed significance of some important phishing criteria like URL & Domain Identity, Security & Encryption to the final phishing detection rate. Finally, our proposed intelligent phishing website detection system was developed, tested and validated by incorporating the scheme as a web based plug-ins phishing toolbar. The results obtained are promising and showed that our intelligent fuzzy based classification detection system can provide an effective help for real-time phishing website detection. The toolbar successfully recognized and detected approximately 92% of the phishing websites selected from our test data set, avoiding many miss-classified websites and false phishing alarms.
    • Social Data Mining for Crime Intelligence: Contributions to Social Data Quality Assessment and Prediction Methods

      Neagu, Daniel; Trundle, Paul R.; Isah, Haruna (University of BradfordDepartment of Computer Science, 2017)
      With the advancement of the Internet and related technologies, many traditional crimes have made the leap to digital environments. The successes of data mining in a wide variety of disciplines have given birth to crime analysis. Traditional crime analysis is mainly focused on understanding crime patterns, however, it is unsuitable for identifying and monitoring emerging crimes. The true nature of crime remains buried in unstructured content that represents the hidden story behind the data. User feedback leaves valuable traces that can be utilised to measure the quality of various aspects of products or services and can also be used to detect, infer, or predict crimes. Like any application of data mining, the data must be of a high quality standard in order to avoid erroneous conclusions. This thesis presents a methodology and practical experiments towards discovering whether (i) user feedback can be harnessed and processed for crime intelligence, (ii) criminal associations, structures, and roles can be inferred among entities involved in a crime, and (iii) methods and standards can be developed for measuring, predicting, and comparing the quality level of social data instances and samples. It contributes to the theory, design and development of a novel framework for crime intelligence and algorithm for the estimation of social data quality by innovatively adapting the methods of monitoring water contaminants. Several experiments were conducted and the results obtained revealed the significance of this study in mining social data for crime intelligence and in developing social data quality filters and decision support systems.
    • Strategy and methodology for enterprise data warehouse development. Integrating data mining and social networking techniques for identifying different communities within the data warehouse.

      Ridley, Mick J.; Alhajj, R.; Rifaie, Mohammad (University of BradfordSchool of Computing, Informatics and Media, 2010-08-19)
      Data warehouse technology has been successfully integrated into the information infrastructure of major organizations as potential solution for eliminating redundancy and providing for comprehensive data integration. Realizing the importance of a data warehouse as the main data repository within an organization, this dissertation addresses different aspects related to the data warehouse architecture and performance issues. Many data warehouse architectures have been presented by industry analysts and research organizations. These architectures vary from the independent and physical business unit centric data marts to the centralised two-tier hub-and-spoke data warehouse. The operational data store is a third tier which was offered later to address the business requirements for inter-day data loading. While the industry-available architectures are all valid, I found them to be suboptimal in efficiency (cost) and effectiveness (productivity). In this dissertation, I am advocating a new architecture (The Hybrid Architecture) which encompasses the industry advocated architecture. The hybrid architecture demands the acquisition, loading and consolidation of enterprise atomic and detailed data into a single integrated enterprise data store (The Enterprise Data Warehouse) where businessunit centric Data Marts and Operational Data Stores (ODS) are built in the same instance of the Enterprise Data Warehouse. For the purpose of highlighting the role of data warehouses for different applications, we describe an effort to develop a data warehouse for a geographical information system (GIS). We further study the importance of data practices, quality and governance for financial institutions by commenting on the RBC Financial Group case. v The development and deployment of the Enterprise Data Warehouse based on the Hybrid Architecture spawned its own issues and challenges. Organic data growth and business requirements to load additional new data significantly will increase the amount of stored data. Consequently, the number of users will increase significantly. Enterprise data warehouse obesity, performance degradation and navigation difficulties are chief amongst the issues and challenges. Association rules mining and social networks have been adopted in this thesis to address the above mentioned issues and challenges. We describe an approach that uses frequent pattern mining and social network techniques to discover different communities within the data warehouse. These communities include sets of tables frequently accessed together, sets of tables retrieved together most of the time and sets of attributes that mostly appear together in the queries. We concentrate on tables in the discussion; however, the model is general enough to discover other communities. We first build a frequent pattern mining model by considering each query as a transaction and the tables as items. Then, we mine closed frequent itemsets of tables; these itemsets include tables that are mostly accessed together and hence should be treated as one unit in storage and retrieval for better overall performance. We utilize social network construction and analysis to find maximum-sized sets of related tables; this is a more robust approach as opposed to a union of overlapping itemsets. We derive the Jaccard distance between the closed itemsets and construct the social network of tables by adding links that represent distance above a given threshold. The constructed network is analyzed to discover communities of tables that are mostly accessed together. The reported test results are promising and demonstrate the applicability and effectiveness of the developed approach.