Document Type: Original Research Paper


1 Department of Electrical Engineering, PhD student, University of Birjand,

2 Department of Electrical Engineering, Faculty of Engineering, University of Birjand,

3 Department of Electrical Engineering, Faculty of Engineering, University of Birjand,

4 KDD lab, ISTI-CNR, Pisa, Italy,


Big data referred to huge datasets with high number of objects and high number of dimensions. Mining and extracting big datasets is beyond the capability of conventional data mining algorithms including clustering algorithms, classification algorithms, feature selection methods and etc. Clustering, which is the process of dividing the data points of a dataset into different groups (clusters) based on their similarities and dissimilarities, is an unsupervised learning method discovers useful information and hidden patterns from raw data. K-means yet is an efficient clustering algorithm but it suffers from some drawbacks. It has a tendency to converge to a local optimum point, its output result depends on its initial value of cluster centers and it is unable in finding the number of clusters. In this research a new clustering method for big datasets is introduced based on Particle Swarm Optimization (PSO) algorithm. PSO is a heuristic algorithm with high ability in searching the solution space and finding the global optimum point. The proposed method is a two-stage algorithm which first searches the solution space for proper number of clusters and then searches to find the position of the centroids. Its performance is evaluated on 13 synthetics and a biological microarray dataset. Finally, 2 real big mobility datasets, are investigated and analyzed using the proposed clustering method.


Main Subjects

[1] J. V. Aggarwal, V. Bhatnagar, and D. K. Mishra, Big Data Analytics, Springer Singapore, 2018, [Online]. Available: [2] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Hung Byers, “Big data: The next frontier for innovation, competition, and productivity,” Tech. Rep., 2011, [Online]. Available: https:// big_ data_ full_report.pdf [3] S. Cheng, Y. Shi, Q. Qin, and R. Bai, “Swarm intelligence in big data analytics,” in Proc. International Conference on Intelligent Data Engineering and Automated Learning, pp. 417-426, 2013. [4] A. Rajaraman and J. D. Ullman, Mining of massive datasets: Cambridge University Press, 2011. [5] S. Cheng, Q. Zhang, and Q. Qin, “Big data analytics with swarm intelligence,” Industrial Management & Data Systems, vol. 116, no. 4, pp. 646-666, 2016. [6] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys (CSUR), vol. 31, no. 3, pp. 264-323, 1999. [7] J. A. Hartigan, “Clustering algorithms,” John Wiley & Sons, 1975, [Online]. Available: algoritmusok_msc/ klaszterezes/ John%20A.%20Hartigan-Clustering%20Algorithms-John% 20Wiley %20&%20 Sons% 20(1975).pdf [8] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 645-678, 2005. [9] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern recognition letters, vol. 31, no. 8, pp. 651-666, 2010. [10] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100-108, 1979. [11] J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-means clustering algorithm,” Computers & Geosciences, vol. 10, no. 2-3, pp. 191-203, 1984. [12] A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, and T. Herawan, “Big data clustering: a review,” in Proc. International Conference on Computational Science and Its Applications, pp. 707-720, 2014. [13] S. Cheng, B. Liu, T. Ting, Q. Qin, Y. Shi, and K. Huang, “Survey on data science with population-based algorithms,” Big Data Analytics, vol. 1, no. 1, p. 3, 2016. [14] Y. Shi, “An optimization algorithm based on brainstorming process,” in Emerging Research on Swarm Intelligence and Algorithm Optimization, pp. 1-35, 2015, DOI: 10.4018/978-1-4666-6328-2.ch00. [15] N. Kokash, “An introduction to heuristic algorithms,” Department of Informatics and Telecommunications, pp. 1-8, 2005, [Online]. Available: [16] X. Yu and M. Gen, Introduction to evolutionary algorithms: Springer Science & Business Media, Springer-Verlag, London 2010, DOI: 10.1007/978-1-84996-129-5. [17] E. Bonabeau, M. Dorigo, and G. Theraulaz, Swarm Intelligence: From Natural to Artificial Systems: no. 1, Oxford university press, 1999, [Online]. Available: [18] J. Kennedy, “Particle swarm optimization,” in Encyclopedia of machine learning, ed: Springer, pp. 760-766, 2011, DOI:10.1007/978-1-4899-7687-1_630. [19] M. H. Mozaffari, H. Abdy, and S. H. Zahiri, “IPO: An inclined planes system optimization algorithm,” Computing and Informatics, vol. 35, no. 1, pp. 222-240, 2016. [20] E. Rashedi, H. Nezamabadi-Pour, and S. Saryazdi, “GSA: A gravitational search algorithm,” Information Sciences, vol. 179, no. 13, pp. 2232-2248, 2009. [21] M. Dorigo and M. Birattari, “Ant colony optimization,” in Encyclopedia of machine learning, ed: Springer, pp. 36-39, 2011. [22] X. Cui, T. E. Potok, and P. Palathingal, “Document clustering using particle swarm optimization,” in Proc. IEEE Swarm Intelligence Symposium, pp. 185-191, 2005. [23] M. G. Omran, A. Salman, and A. P. Engelbrecht, “Dynamic clustering using particle swarm optimization with application in image segmentation,” Pattern Analysis and Applications, vol. 8, no. 4, p. 332, 2006. [24] A. Abraham, S. Das, and S. Roy, “Swarm intelligence algorithms for data clustering,” in Soft Computing for Knowledge Discovery and Data Mining, O. Maimon and L. Rokach, Eds., ed Boston, MA: Springer US, pp. 279-313, 2008, [Online]. Available: [25] A. Ahmadyfard and H. Modares, “Combining PSO and k-means to enhance data clustering,” in Proc. IEEE International Symposium on Telecommunications, pp. 688-691, 2008. [26] C. Zhang, D. Ouyang, and J. Ning, “An artificial bee colony approach for clustering,” Expert Systems with Applications, vol. 37, no. 7, pp. 4761-4767, 2010. [27] D. Karaboga and B. Basturk, “On the performance of artificial bee colony (ABC) algorithm,” Applied Soft Computing, vol. 8, no. 1, pp. 687-697, 2008. [28] G. Krishnasamy, A. J. Kulkarni, and R. Paramesran, “A hybrid approach for data clustering based on modified cohort intelligence and K-means,” Expert Systems with Applications, vol. 41, no. 13, pp. 6009-6016, 2014. [29] A. J. Kulkarni, I. P. Durugkar, and M. Kumar, “Cohort intelligence: a self-supervised learning behavior,” in Proc. IEEE International Conference on Systems, Man, and Cybernetics, pp. 1396-1400, 2013. [30] S. H. Razavi, E. O. M. Ebadati, S. Asadi, and H. Kaur, “An efficient grouping genetic algorithm for data clustering and big data analysis,” in Computational Intelligence for Big Data Analysis, ed: Springer, pp. 119-142, 2015, DOI: 10.1007/978-3-319-16598-1_5. [31] Y. Lu, B. Cao, C. Rego, and F. Glover, “a tabu search based clustering algorithm and its parallel implementation on spark,” Applied Soft Computing, vol. 63, no. 63, pp. 97-109, 2018. [32] M. Fahad, F. Aadil, Z. u. Rehman, S. Khan, P. A. Shah, K. Muhammad, et al., “Grey wolf optimization based clustering algorithm for vehicular ad-hoc networks,” Computers & Electrical Engineering, vol. 70, no. 1, pp. 853-870, 2018. [33] A. Starczewski and A. Krzyżak, “Performance evaluation of the silhouette index," in Proc. International Conference on Artificial Intelligence and Soft Computing, pp. 49-58, 2015. [34] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 1, pp. 224-227, 1979. [35] M. K. Pakhira, S. Bandyopadhyay, and U. Maulik, "Validity index for crisp and fuzzy clusters,” Pattern Recognition, vol. 37, no. 3, pp. 487-501, 2004. [36] T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statistics-theory and Methods, vol. 3, no. 1, pp. 1-27, 1974. [37] D. Pelleg and A. W. Moore, “X-means: Extending k-means with efficient estimation of the number of clusters,” in Proc. 17th International Conf. on Machine Learning Citations (ICML), vol. 1, no. 1, pp. 727-734, 2000. [38] School of Computing University of Eastern Finland. (2015, April 4, 2018). Clustering basic benchmark. Available: [39] W. M. Rand, "Objective criteria for the evaluation of clustering methods," Journal of the American Statistical association, vol. 66, no. 336, pp. 846-850, 1971. [40] A. F. McDaid, D. Greene, and N. Hurley, “Normalized mutual information to evaluate overlapping community finding algorithms,” [Online]. Available: , 2011. [41] H. Aidos, R. P. Duin, and A. L. Fred, “The area under the ROC curve as a criterion for clustering evaluation," in Proc. 2nd International Conference on Pattern Recognition Applications and Methods ( ICPRAM), pp. 276-280, 2013. [42] P. Fränti and O. Virmajoki, “Iterative shrinking method for clustering problems,” Pattern Recognition, vol. 39, no. 5, pp. 761-775, 2006. [43] I. Kärkkäinen and P. Fränti, Dynamic local search algorithm for the clustering problem: University of Joensuu, 2002. [44] P. Fränti, R. Mariescu-Istodor, and C. Zhong, “XNN graph,” in Proc. Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 207-217, 2016. [45] P. Franti, O. Virmajoki, and V. Hautamaki, “Fast agglomerative clustering using a k-nearest neighbor graph,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1875-1881, 2006. [46] B. D. Lehmann, J. A. Bauer, X. Chen, M. E. Sanders, A. B. Chakravarthy, Y. Shyr, and et al., “Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies,” The Journal of Clinical Investigation, vol. 121, pp. 2750-2767, 2011.