Evaluation of Optimal Number of Topics of Topic Model: An Approach Based on the Quality of Clusters

Fedor Krasnov

Abstract


Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on this subject in recent years. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of the topic model. The authors analyzed the internal metrics of the topic model: Coherence, Contrast, and Purity to determine the optimal number of topics and concluded that they are not applicable to solve this problem. The authors analyzed the approach to choosing the optimal number of topics based on the quality of the clusters. For this purpose, the authors considered the behavior of the cluster validation metrics:  Davies Bouldin Index, Silhouette Coefficient and Calinski-Harabaz.

The cornerstone of the proposed new method of determining the optimal number of topics based on the following principles: setting up a topic model with additive regularization (ARTM) to separate noise topics; using dense vector representation (GloVe, FastText, Word2Vec); using a cosine measure for the distance in cluster metric that works better on vectors with large dimensions than The Euclidean distance.

The methodology developed by the authors for obtaining the optimal number of topics was tested on the collection of scientific articles from the Onepetro library, selected by specific themes. The experiment showed that the method proposed by the authors allows assessing the optimal number of topics for the topic model built on a small collection of English-language documents.

Full Text:

PDF (Russian)

References


Vorontsov K., Potapenko A., Plavin A. Additive regularization of topic models for topic selection and sparse factorization //International Symposium on Statistical Learning and Data Sciences. – Springer, Cham, 2015. – С. 193-202.

Koltsov S., Pashakhin S., Dokuka S. A Full-Cycle Methodology for News Topic Modeling and User Feedback Research //International Conference on Social Informatics. – Springer, Cham, 2018. – С. 308-321.

Seroussi Y., Bohnert F., Zukerman I. Authorship attribution with author-aware topic models //Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. – Association for Computational Linguistics, 2012. – С. 264-269.

Blei D. M., Ng A. Y., Jordan M. I. Latent dirichlet allocation //Journal of machine Learning research. – 2003. – Т. 3. – №. Jan. – С. 993-1022.

Binkley D. et al. Understanding LDA in source code analysis //Proceedings of the 22nd international conference on program comprehension. – ACM, 2014. – С. 26-36.

Agrawal A., Fu W., Menzies T. What is wrong with topic modeling? And how to fix it using search-based software engineering //Information and Software Technology. – 2018. – Т. 98. – С. 74-88.

Storn R., Price K. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces //Journal of global optimization. – 1997. – Т. 11. – №. 4. – С. 341-359.

Asuncion, A., Welling, M., Smyth, P., & Teh, Y. W. On smoothing and inference for topic models //Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. – AUAI Press, 2009. – С. 27-34.

Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. Evaluation methods for topic models //Proceedings of the 26th annual international conference on machine learning. – ACM, 2009. – С. 1105-1112.

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. Reading tea leaves: How humans interpret topic models //Advances in neural information processing systems. – 2009. – С. 288-296.

Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. Automatic evaluation of topic coherence //Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. – Association for Computational Linguistics, 2010. – С. 100-108.

Koltcov S., Koltsova O., Nikolenko S. Latent dirichlet allocation: stability and applications to studies of user-generated content //Proceedings of the 2014 ACM conference on Web science. – ACM, 2014. – С. 161-165.

Koltcov S. Application of Rényi and Tsallis entropies to topic modeling optimization //Physica A: Statistical Mechanics and its Applications. – 2018. – Т. 512. – С. 1192-1204.

Batmanghelich, K., Saeedi, A., Narasimhan, K., & Gershman, S. Nonparametric spherical topic modeling with word embeddings //arXiv preprint arXiv:1604.00126. – 2016.

Lipton Z. C. The mythos of model interpretability //arXiv preprint arXiv:1606.03490. – 2016.

Bing X., Bunea F., Wegkamp M. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics //arXiv preprint arXiv:1805.06837. – 2018.

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. Sharing clusters among related groups: Hierarchical Dirichlet processes //Advances in neural information processing systems. – 2005. – С. 1385-1392.

Blei D. M., Griffiths T. L., Jordan M. I. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies //Journal of the ACM (JACM). – 2010. – Т. 57. – №. 2. – С. 7.

Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. Hierarchical topic models and the nested chinese restaurant process //Advances in neural information processing systems. – 2004. – С. 17-24.

Rossetti M., Stella F., Zanker M. Towards explaining latent factors with topic models in collaborative recommender systems //Database and Expert Systems Applications (DEXA), 2013 24th International Workshop on. – IEEE, 2013. – С. 162-167.

Fang D, Yang H, Gao B, Li X. Discovering research topics from library electronic references using latent Dirichlet allocation //Library Hi Tech. – 2018. – Т. 36. – №. 3. – С. 400-410.

El-Assady, M., Sevastjanova, R., Sperrle, F., Keim, D., & Collins, C. Progressive learning of topic modeling parameters: a visual analytics framework //IEEE transactions on visualization and computer graphics. – 2018. – Т. 24. – №. 1. – С. 382-391.

Law, J., Zhuo, H. H., He, J., & Rong, E. LTSG: Latent Topical Skip-Gram for Mutually Learning Topic Model and Vector Representations //arXiv preprint arXiv:1702.07117. – 2017.

Nikolenko S. I., Koltcov S., Koltsova O. Topic modelling for qualitative studies //Journal of Information Science. – 2017. – Т. 43. – №. 1. – С. 88-102.

Mimno D., Blei D. Bayesian checking for topic models //Proceedings of the conference on empirical methods in natural language processing. – Association for Computational Linguistics, 2011. – С. 227-237.

Das R., Zaheer M., Dyer C. Gaussian lda for topic models with word embeddings //Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). – 2015. – Т. 1. – С. 795-804.

Nguyen, D. Q., Billingsley, R., Du, L., Johnson, M., & Fe, S. (2015). Improving Topic Models with Latent Feature Word Representations.

Bryant M., Sudderth E. B. Truly nonparametric online variational inference for hierarchical Dirichlet processes //Advances in Neural Information Processing Systems. – 2012. – С. 2699-2707.

Dunn J. C. Well-separated clusters and optimal fuzzy partitions //Journal of cybernetics. – 1974. – Т. 4. – №. 1. – С. 95-104.

Bezdek J. C. Cluster validity with fuzzy sets. – 1973.

Davies D. L., Bouldin D. W. A cluster separation measure //IEEE transactions on pattern analysis and machine intelligence. – 1979. – №. 2. – С. 224-227.

Halkidi M., Batistakis Y., Vazirgiannis M. Clustering validity checking methods: part II //ACM Sigmod Record. – 2002. – Т. 31. – №. 3. – С. 19-27.

Xie X. L., Beni G. A validity measure for fuzzy clustering //IEEE Transactions on Pattern Analysis & Machine Intelligence. – 1991. – №. 8. – С. 841-847.

Pennington J., Socher R., Manning C. Glove: Global vectors for word representation //Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). – 2014. – С. 1532-1543.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. Enriching word vectors with subword information //arXiv preprint arXiv:1607.04606. – 2016.

Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., & Weston, J. Starspace: Embed all the things! //arXiv preprint arXiv:1709.03856. – 2017.

Bicalho, P. V., de Oliveira Cunha, T., Mourao, F. H. J., Pappa, G. L., & Meira, W. Generating Cohesive Semantic Topics from Latent Factors //Intelligent Systems (BRACIS), 2014 Brazilian Conference on. – IEEE, 2014. – С. 271-276.

Kuhn A., Ducasse S., Gírba T. Semantic clustering: Identifying topics in source code //Information and Software Technology. – 2007. – Т. 49. – №. 3. – С. 230-243.

Chuang J. et al. TopicCheck: Interactive alignment for assessing topic model stability //Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. – 2015. – С. 175-184.

Greene D., O’Callaghan D., Cunningham P. How many topics? stability analysis for topic models //Joint European Conference on Machine Learning and Knowledge Discovery in Databases. – Springer, Berlin, Heidelberg, 2014. – С. 498-513.

Rousseeuw P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis //Journal of computational and applied mathematics. – 1987. – Т. 20. – С. 53-65.

Mehta V., Caceres R. S., Carter K. M. Evaluating topic quality using model clustering //Computational Intelligence and Data Mining (CIDM), 2014 IEEE Symposium on. – IEEE, 2014. – С. 178-185.

Koltcov, S., Nikolenko, S. I., Koltsova, O., Filippov, V., & Bodrunova, S. Stable topic modeling with local density regularization //International Conference on Internet Science. – Springer, Cham, 2016. – С. 176-188.

Krasnov F., Ushmaev O. Exploration of Hidden Research Directions in Oil and Gas Industry via Full Text Analysis of OnePetro Digital Library //International Journal of Open Information Technologies. – 2018. – Т. 6. – №. 5. – С. 7-14.

Borg I., Groenen P. Modern multidimensional scaling: theory and applications //Journal of Educational Measurement. – 2003. – Т. 40. – №. 3. – С. 277-280.

Caliński T., Harabasz J. A dendrite method for cluster analysis //Communications in Statistics-theory and Methods. – 1974. – Т. 3. – №. 1. – С. 1-27.

Mantyla M. V., Claes M., Farooq U. Measuring LDA topic stability from clusters of replicated runs //Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. – ACM, 2018. – С. 49.


Refbacks



Abava  Кибербезопасность IT Congress 2024

ISSN: 2307-8162