On the Role of Prepositional Statistics for Genre Identification of Russian texts

O. A. Mitrofanova, A. D. Moskvina


In this work we investigate the role of statistical data on function words for automatic identification of genre and topical characteristics of Russian texts. We use the ratio of semantically related prepositions as the principal linguistic parameter. We consider seven frequent prepositions which have spatial meaning and also reveal one or more figurative meanings: под (under) / над (over), в (in) / из (from), к (to) / от (from), за (behind) / перед (in front of), в (in) / на (at), на (at) / с (from). Our research hypothesis claims that coefficients of preposition frequency ratios in the above mentioned pairs may indicate stylistic properties of the texts. We based our research on several corpora representing different genres and topics: general, literary, publicistic, non-literary, oral subcorpora of the Russian National Corpus (RNC), Russian corpora from the Aranea superlarge corpora family, namely, Araneum Russicum Russicum and Araneum Russicum Externum corpora, as well as social media corpus including posts and comments from Facebook and Twitter networks, and literary corpus including texts from Librusec digital library. We verified the hypothesis on the stylistic homogeneity of oral and written speech of social media users, our verification was based on statistical analysis of polysemous prepositions. Experiments proved the significance of под (under) / над (over) coefficient in style and text type detection, and revealed the role of в (in) / из (from) and за (behind) / перед (in front of) in differentiation of written and oral texts. We obtained evidence on the statistics of preposition occurrence, as well as the information on the semantic content of prepositional phrases, which is of great significance for text style, genre and topic detection. We found out and analyzed the main properties of the use of polysemous prepositions.

Full Text:

PDF (Russian)


Andreev V.S., Beliaeva L.N. Internal Dynamics of Text: Parts of Speech Distribution in Verse // PRLEAL-2019: R. Piotrowski's Readings in Language Engineering and Applied Linguistics, Proceedings of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), CEUR Workshop Proceedings, Vol-2552, pp. 151-160.

Argamon S., Levitan S. Measuring the usefulness of function words for authorship attribution // Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, 2005.

Azarova I., Khokhlova M., Zakharov V., Petkevič V. Ontological description of Russian prepositions // PRLEAL-2019: R. Piotrowski's Readings in Language Engineering and Applied Linguistics, Proceedings of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), CEUR Workshop Proceedings, Vol-2552, pp. 245-257.

Benko V. Aranea: Yet Another Family of (Comparable) Web Corpora // Text, Speech and Dialogue. 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings. LNCS 8655. Springer International Publishing Switzerland, 2014, pp. 257-264.

Burrows J. ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship // Literary and Linguistic Computing, Vol. 17, No. 3, 2002, pp. 267-287.

Kestemont M. Function Words in Authorship Attribution. From Black Magic to Theory // Proceedings of the 3rd Workshop on Computational Linguistics for Literature, Gothenburg, 2014.

Mosteller F., Wallace D. Inference in an Authorship Problem // Journal of the American Statistical Association, 58(302), 1963, pp. 275-309.

Voronov S.O. Fil`traciya i tematicheskoe modelirovanie kollekcii nauchny`x dokumentov. Dolgoprudny`j, 2014.

Litvinova T.A. Stilemetricheskoe issledovanie tekstov uchastnikov e`kstremistskogo foruma: genderny`j aspekt // Izvestiya Voronezhskogo gosudarstvennogo pedagogicheskogo universiteta. Seriya «Filologicheskie nauki», 2019, № 4 (285), pp. 227-236.

Lyashevskaya O.N., Sharov S.A. Chastotny`j slovar` sovremennogo russkogo yazy`ka (na materialax Nacional`nogo korpusa russkogo yazy`ka). M., 2009.

Marty`nenko G.Ya. Osnovy` stilemetrii. L., 1988.

Marusenko M.A. Atribuciya anonimny`x i psevdonimny`x literaturny`x proizvedenij metodami raspoznavaniya obrazov. L., 1990.

Rubiner V.I. Klassifikaciya internet-stranicz: algoritmy` // Strukturnaya i prikladnaya lingvistika. Vy`p. 10. SPb., 2014.

Sichinava D.V. Ob odnom lingvisticheskom parametre tipologii tekstov: koe`fficient «pod/nad» // Nauchno-texnicheskaya informaciya, Seriya 2, № 10, 2003, pp. 27-35.

Tuldava Yu. Problemy` i metody` kvantitativno-sistemnogo issledovaniya leksiki. Tartu, 1987.


  • There are currently no refbacks.

Abava  Absolutech Convergent 2020

ISSN: 2307-8162