Document Type : Original Research Paper
Authors
1 Department of Electrical Engineering, PhD student, University of Birjand, i.behravan@birjand.ac.ir
2 Department of Electrical Engineering, Faculty of Engineering, University of Birjand, hzahiri@birjand.ac.ir
3 Department of Electrical Engineering, Faculty of Engineering, University of Birjand,
4 KDD lab, ISTI-CNR, Pisa, Italy, roberto.trasarti@isti.cnr.it
Abstract
Background and Objectives: Big data referred to huge datasets with high number of objects and high number of dimensions. Mining and extracting big datasets is beyond the capability of conventional data mining algorithms including clustering algorithms, classification algorithms, feature selection methods and etc.
Methods: Clustering, which is the process of dividing the data points of a dataset into different groups (clusters) based on their similarities and dissimilarities, is an unsupervised learning method which discovers useful information and hidden patterns from raw data. In this research a new clustering method for big datasets is introduced based on Particle Swarm Optimization (PSO) algorithm. The proposed method is a two-stage algorithm which first searches the solution space for proper number of clusters and then searches to find the position of the centroids.
Results: the performance of the proposed method is evaluated on 13 synthetic datasets. Also its performance is compared to X-means through calculating two evaluation metrics: Rand index and NMI index. The results demonstrate the superiority of the proposed method over X-means for all of the synthetic datasets. Furthermore, a biological microarray dataset is used to evaluate the proposed method deeper. Finally, 2 real big mobility datasets, including the trajectories traveled by several cars in the city of Pisa, are analyzed using the proposed clustering method. The first dataset includes the trajectories recorded in Sunday and the second one contains the trajectories recorded in Monday during 5 weeks. The achieved results showed that people choose more diverse destinations in Sunday although it has fewer trajectories.
Conclusion: Finding the number of clusters is a big challenge especially fir big datasets. The results achieved for the proposed method showed its fabulous performance in detecting the number of clusters for high dimensional and massive datasets. Also, the results demonstrate the power and effectiveness of the swarm intelligence methods in solving hard and complex optimization problems.
======================================================================================================
Copyrights
©2018 The author(s). This is an open access article distributed under the terms of the Creative Commons Attribution (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, as long as the original authors and source are cited. No permission is required from the authors or the publishers.
======================================================================================================
Keywords
Main Subjects
Send comment about this article