• Exploring Methods for Comparing Similarity of Dimensionally Inconsistent Multivariate Numerical Data

      Micic, Natasha; Neagu, Daniel; Torgunov, Denis; Campean, I. Felician (2018-06-28)
      When developing multivariate data classification and clustering methodologies for data mining, it is clear that most literature contributions only really consider data that contain consistently the same attributes. There are however many cases in current big data analytics applications where for same topic and even same source data sets there are differing attributes being measured, for a multitude of reasons (whether the specific design of an experiment or poor data quality and consistency). We define this class of data a dimensionally inconsistent multivariate data, a topic that can be considered a subclass of the Big Data Variety research. This paper explores some classification methodologies commonly used in multivariate classification and clustering tasks and considers how these traditional methodologies could be adapted to compare dimensionally inconsistent data sets. The study focuses on adapting two similarity measures: Robinson-Foulds tree distance metrics and Variation of Information; for comparing clustering of hierarchical cluster algorithms (such clusters are derived from the raw multivariate data). The results from experiments on engineering data highlight that adapting pairwise measures to exclude non-common attributes from the traditional distance metrics may not be the best method of classification. We suggest that more specialised metrics of similarity are required to address challenges presented by dimensionally inconsistent multivariate data, with specific applications for big engineering data analytics.
    • Proximity curves for potential-based clustering

      Csenki, Attila; Neagu, Daniel; Torgunov, Denis; Micic, Natasha (2020)
      The concept of proximity curve and a new algorithm are proposed for obtaining clusters in a finite set of data points in the finite dimensional Euclidean space. Each point is endowed with a potential constructed by means of a multi-dimensional Cauchy density, contributing to an overall anisotropic potential function. Guided by the steepest descent algorithm, the data points are successively visited and removed one by one, and at each stage the overall potential is updated and the magnitude of its local gradient is calculated. The result is a finite sequence of tuples, the proximity curve, whose pattern is analysed to give rise to a deterministic clustering. The finite set of all such proximity curves in conjunction with a simulation study of their distribution results in a probabilistic clustering represented by a distribution on the set of dendrograms. A two-dimensional synthetic data set is used to illustrate the proposed potential-based clustering idea. It is shown that the results achieved are plausible since both the ‘geographic distribution’ of data points as well as the ‘topographic features’ imposed by the potential function are well reflected in the suggested clustering. Experiments using the Iris data set are conducted for validation purposes on classification and clustering benchmark data. The results are consistent with the proposed theoretical framework and data properties, and open new approaches and applications to consider data processing from different perspectives and interpret data attributes contribution to patterns.