Comparative Analysis of the Accuracy of Methods for Visualizing the Structure of a Text Collection
Abstract
Visualization of multidimensional data is the most important stage of data research. Often, decisions on the further stages of the study are made from the flat view of the data based on "rough proportions". High visibility and persuasiveness of representation on the plane of multidimensional vectors with the preservation of distances is used in models of distributive semantics (Word2Vec, GloVe, NaVec) successfully. On the other hand, the inaccuracy of the two-dimensional projection can lead to time being spent searching for non-existent multidimensional structures. The author set the task to evaluate the accuracy of dimensionality reduction methods with the following limitations: multi-dimensionality arises as a result of vector representation of text documents, dimensionality reduction is aimed at visualization on the plane. In numerous methods of dimension reduction, there is no separate class of approaches specifically for visualization. To measure the accuracy, an approach was chosen using marked-up data and quantifying the preservation of the markup while reducing the dimension. The author investigated 12 methods of reducing the dimension on two labeled data sets in Russian and English. Using the Silhouette Coefficient metric, the most accurate visualization method for text data was determined as UMAP with the Hellinger distance as the metric.
Full Text:
PDF (Russian)References
Maaten, L. V. D. and Geoffrey E. Hinton. “Visualizing Data using t-SNE.” Journal of Machine Learning Research 9 (2008): 2579-2605.
McInnes, Leland, John Healy, and James Melville. "Umap: Uniform manifold approximation and projection for dimension reduction." arXiv preprint arXiv:1802.03426 (2018).
Peter J. Rousseeuw . “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis”. Computational and Applied Mathematics (1987) 20: 53–65. doi:10.1016/0377-0427(87)90125-7
Narayan, A., Berger, B., & Cho, H. “Density-Preserving Data Visualization Unveils Dynamic Patterns of Single-Cell Transcriptomic Variability.” bioRxiv (2020).
Lang, Ken. "Newsweeder: Learning to filter netnews." Machine Learning Proceedings 1995. Morgan Kaufmann, 1995. 331-339.
Shavrina T., Shapovalova O. “To the methodology of corpus construction for machine learning: «taiga» syntax tree corpus and parser”. In Proc. of “CORPORA2017”, International Conference , Saint-Petersbourg, (2017).
Krasnov F.V., Smaznevich I.S. The explicability factor of the algorithm in the problems of searching for the similarity of text documents // Computational technologies. 2020. V. 25. № 5. P. 107-123
Krasnov F.V., Baskakova E.N., Smaznevich I.S. 2021. The principle of constructing a corpus of normative and technical documents. PREPRINTS.RU. https://doi.org/10.24108/preprints-3112181.
Refbacks
Abava Кибербезопасность IT Congress 2024
ISSN: 2307-8162