Document Type: Original Research Paper

Authors

1 Department of Electrical Engineering, PhD student, University of Birjand, i.behravan@birjand.ac.ir

2 Department of Electrical Engineering, Faculty of Engineering, University of Birjand, hzahiri@birjand.ac.ir

3 Department of Electrical Engineering, Faculty of Engineering, University of Birjand,

4 KDD lab, ISTI-CNR, Pisa, Italy, roberto.trasarti@isti.cnr.it

Abstract

Background and Objectives: Big data referred to huge datasets with high number of objects and high number of dimensions. Mining and extracting big datasets is beyond the capability of conventional data mining algorithms including clustering algorithms, classification algorithms, feature selection methods and etc.
 Methods: Clustering, which is the process of dividing the data points of a dataset into different groups (clusters) based on their similarities and dissimilarities, is an unsupervised learning method which discovers useful information and hidden patterns from raw data. In this research a new clustering method for big datasets is introduced based on Particle Swarm Optimization (PSO) algorithm. The proposed method is a two-stage algorithm which first searches the solution space for proper number of clusters and then searches to find the position of the centroids.
Results: the performance of the proposed method is evaluated on 13 synthetic datasets. Also its performance is compared to X-means through calculating two evaluation metrics: Rand index and NMI index. The results demonstrate the superiority of the proposed method over X-means for all of the synthetic datasets.  Furthermore, a biological microarray dataset is used to evaluate the proposed method deeper. Finally, 2 real big mobility datasets, including the trajectories traveled by several cars in the city of Pisa, are analyzed using the proposed clustering method. The first dataset includes the trajectories recorded in Sunday and the second one contains the trajectories recorded in Monday during 5 weeks. The achieved results showed that people choose more diverse destinations in Sunday although it has fewer trajectories.
Conclusion: Finding the number of clusters is a big challenge especially fir big datasets. The results achieved for the proposed method showed its fabulous performance in detecting the number of clusters for high dimensional and massive datasets. Also, the results demonstrate the power and effectiveness of the swarm intelligence methods in solving hard and complex optimization problems.

Keywords

Main Subjects

[1] J. V. Aggarwal, V. Bhatnagar, D. K. Mishra, Big Data Analytics, Springer Singapore, 2018,

[2] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, A. Hung Byers, “Big data: The next frontier for innovation, competition, and productivity,” Tech. Rep., 2011,

[3] S. Cheng, Y. Shi, Q. Qin, R. Bai, “Swarm intelligence in big data analytics,” in Proc. International Conference on Intelligent Data Engineering and Automated Learning: 417-426, 2013.

[4] A. Rajaraman, J. D. Ullman, Mining of massive datasets: Cambridge University Press, 2011.

[5] S. Cheng, Q. Zhang, Q. Qin, “Big data analytics with swarm intelligence,” Industrial Management & Data Systems, 116(4): 646-666, 2016.

[6] A. K. Jain, M. N. Murty, P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys (CSUR), 31(3): 264-323, 1999.

[7] J. A. Hartigan, “Clustering algorithms,” John Wiley & Sons, 1975,

[8] R. Xu, D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on Neural Networks, 16(1): 645-678, 2005.

[9] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern recognition letters, 31(8): 651-666, 2010.

[10] J. A. Hartigan, M. A. Wong, “Algorithm AS 136: A k-means clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1): 100-108, 1979.

[11] J. C. Bezdek, R. Ehrlich, W. Full, “FCM: The fuzzy c-means clustering algorithm,” Computers & Geosciences, 10(2-3): 191-203, 1984.

[12] A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, T. Herawan, “Big data clustering: a review,” in Proc. International Conference on Computational Science and Its Applications: 707-720, 2014.

[13] S. Cheng, B. Liu, T. Ting, Q. Qin, Y. Shi, K. Huang, “Survey on data science with population-based algorithms,” Big Data Analytics, 1(1): 3, 2016.

[14] Y. Shi, “An optimization algorithm based on brainstorming process,” in Emerging Research on Swarm Intelligence and Algorithm Optimization: 1-35, 2015.

[15] N. Kokash, “An introduction to heuristic algorithms,” Department of Informatics and Telecommunications: 1-8, 2005,

[16] X. Yu, M. Gen, Introduction to evolutionary algorithms: Springer Science & Business Media, Springer-Verlag, London 2010.

[17] E. Bonabeau, M. Dorigo, G. Theraulaz, Swarm Intelligence: From Natural to Artificial Systems: 1, Oxford university press, 1999,

[18] J. Kennedy, “Particle swarm optimization,” in Encyclopedia of machine learning, ed: Springer: 760-766, 2011.

[19] M. H. Mozaffari, H. Abdy, S. H. Zahiri, “IPO: An inclined planes system optimization algorithm,” Computing and Informatics, 35(1): 222-240, 2016.

[20] E. Rashedi, H. Nezamabadi-Pour, S. Saryazdi, “GSA: A gravitational search algorithm,” Information Sciences, 179(13): 2232-2248, 2009.

[21] M. Dorigo, M. Birattari, “Ant colony optimization,” in Encyclopedia of machine learning, ed: Springer: 36-39, 2011.

[22] X. Cui, T. E. Potok, P. Palathingal, “Document clustering using particle swarm optimization,” in Proc. IEEE Swarm Intelligence Symposium: 185-191, 2005.

[23] M. G. Omran, A. Salman, A. P. Engelbrecht, “Dynamic clustering using particle swarm optimization with application in image segmentation,” Pattern Analysis and Applications, 8(4): 332, 2006.

[24] A. Abraham, S. Das, S. Roy, “Swarm intelligence algorithms for data clustering,” in Soft Computing for Knowledge Discovery and Data Mining, O. Maimon and L. Rokach, Eds., ed Boston, MA: Springer US: 279-313, 2008,

[25] A. Ahmadyfard, H. Modares, “Combining PSO and k-means to enhance data clustering,” in Proc. IEEE International Symposium on Telecommunications: 688-691, 2008.

[26] C. Zhang, D. Ouyang, J. Ning, “An artificial bee colony approach for clustering,” Expert Systems with Applications, 37(7): 4761-4767, 2010.

[27] D. Karaboga, B. Basturk, “On the performance of artificial bee colony (ABC) algorithm,” Applied Soft Computing, 8(1): 687-697, 2008.

[28] G. Krishnasamy, A. J. Kulkarni, R. Paramesran, “A hybrid approach for data clustering based on modified cohort intelligence and K-means,” Expert Systems with Applications, 41(13): 6009-6016, 2014.

[29] A. J. Kulkarni, I. P. Durugkar, M. Kumar, “Cohort intelligence: a self-supervised learning behavior,” in Proc. IEEE International Conference on Systems, Man, and Cybernetics: 1396-1400, 2013.

[30] S. H. Razavi, E. O. M. Ebadati, S. Asadi, H. Kaur, “An efficient grouping genetic algorithm for data clustering and big data analysis,” in Computational Intelligence for Big Data Analysis, ed: Springer: 119-142, 2015.

[31] Y. Lu, B. Cao, C. Rego, and F. Glover, “a tabu search based clustering algorithm and its parallel implementation on spark,” Applied Soft Computing, 63(63): 97-109, 2018.

[32] M. Fahad, F. Aadil, Z. u. Rehman, S. Khan, P. A. Shah, K. Muhammad, et al., “Grey wolf optimization based clustering algorithm for vehicular ad-hoc networks,” Computers & Electrical Engineering, 70(1): 853-870, 2018.

[33] A. Starczewski, A. Krzyżak, “Performance evaluation of the silhouette index," in Proc. International Conference on Artificial Intelligence and Soft Computing: 49-58, 2015.

[34] D. L. Davies, D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(1): 224-227, 1979.

[35] M. K. Pakhira, S. Bandyopadhyay, U. Maulik, "Validity index for crisp and fuzzy clusters,” Pattern Recognition, 37(3): 487-501, 2004.

[36] T. Caliński, J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statistics-theory and Methods, 3(1): 1-27, 1974.

[37] D. Pelleg, A. W. Moore, “X-means: Extending k-means with efficient estimation of the number of clusters,” in Proc. 17th International Conf. on Machine Learning Citations (ICML), 1(1): 727-734, 2000.

[38] School of Computing University of Eastern Finland. (2015, April 4, 2018). Clustering basic benchmark.

[39] W. M. Rand, "Objective criteria for the evaluation of clustering methods," Journal of the American Statistical association, 66(336): 846-850, 1971.

[40] A. F. McDaid, D. Greene, N. Hurley, "Normalized mutual information to evaluate overlapping community finding algorithms," arXiv preprint arXiv:1110.2515, 2011.

[41] H. Aidos, R. P. Duin, A. L. Fred, “The area under the ROC curve as a criterion for clustering evaluation," in Proc. 2nd International Conference on Pattern Recognition Applications and Methods (ICPRAM): 276-280, 2013.

[42] P. Fränti, O. Virmajoki, “Iterative shrinking method for clustering problems,” Pattern Recognition, 39(5): 761-775, 2006.

[43] I. Kärkkäinen, P. Fränti, Dynamic local search algorithm for the clustering problem: University of Joensuu, 2002.

[44] P. Fränti, R. Mariescu-Istodor, C. Zhong, “XNN graph,” in Proc. Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR): 207-217, 2016.

[45] P. Franti, O. Virmajoki, V. Hautamaki, “Fast agglomerative clustering using a k-nearest neighbor graph,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11): 1875-1881, 2006.

[46] B. D. Lehmann, J. A. Bauer, X. Chen, M. E. Sanders, A. B. Chakravarthy, Y. Shyr, et al., “Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies,” The Journal of Clinical Investigation, 121: 2750-2767, 2011.