Dynamical Systems
Computer analysis of texts
N.L. Avanesyan, O.V. Gubina, A.M. Chepovskiy Corpus analysis methods for study of texts of prose literary works by various authors
Methods and models of system analysis
Risk management and safety
System analysis in medicine and biology
N.L. Avanesyan, O.V. Gubina, A.M. Chepovskiy Corpus analysis methods for study of texts of prose literary works by various authors
Abstract. 

This article is devoted to the application of corpora analysis mathematical methods for the research of Russian fiction texts. A corpus of prose texts of Russian XIX century fiction, consisting of five subcorpora, has been created for the research. Each subcorpora contains texts of one certain author. Using the example of the created corpora, the possibilities of using the correspondence analysis method integrated into the TXM platform as one of the tools of the statistical research method are demonstrated. As another method, we consider the analysis of pairwise rank correlation coefficients to compare the frequency characteristics of texts of different subcorps. The methods described give correlated results and make it possible to identify differentiating features. The methods described give correlated results and make it possible to identify differentiating features. The described method can be used both for linguistic and literary studies and for creating appropriate training text sets for artificial intelligence tasks.

Keywords: 

corpus linguistics, TXM platform, correspondence analysis, correlation analysis.

DOI: 10.14357/20790279240204 

EDN: IKHGUO

PP. 25-32.
 
References

1. Lavrentiev A.M., Smirnov I.V., Solovev F.N., Suvorova M.I., Fokina A.I., Chepovskiy A.M. Analysis of corpus of extremist texts and unlawful texts // Voprosi kiberbezopasnosti. 2019. № 4(32). P. 54–60. DOI 10.21681/2311-3456-2019-4-54-60. [in Russian].
2. Lavrentiev A.M., Smirnov I.V., Soloviev F.N., Suvorova M.I., Fokina A.I., Chepovskiy A.M. Creating text corpora for special purposes on the basis of extended TXM platform // Sistemy vysokoy dostupnosti, 2018. Vol. 14.  No. 3, P. 76–81. [in Russian].
3. Avanesyan N.L., Solovev F.N., Tikhomirova E.A., Chepovskiy А.М. Identifying the significant features in illegal texts. Voprosy kiberbezopasnosti, 2020. No. 4 (38). P. 76–84. (in Russ.) DOI 10.21681/2311-3456-2020-04-76-84.
4. Fokina A.I., Chepovskiy A.A., Chepovskiy A.M. Using TXM Platform of Corpus Analysis for Text Analysis of Social Media // Vestnik NSU. Series: Information Technologies. 2023. Vol. 21. No. 2. P. 29 – 38. DOI 10.25205/1818-7900-2023-21-2-29-38. [in Russian].
5. Avanesyan N.L., Zenkova V.V., Chepovskiy A.A., Chepovskiy A.M. Analysis of Social Media Community Posts. Russian // Journal of Cybernetics. 2023;4(2):33–39. DOI: 10.51790/2712-9942-2023-4-2-05.
6. Heiden S.The TXM Platorm: Building Open-Source Textual Analysis Sofware Compatile with the TEI Encoding Scheme In: 24th Pacific Asia Conference on Language, Information and Computation – PACLIC24 / Ed. by R. Otoguro, K. Ishikawa, H. Umemoto, K. Yoshimoto and Y. Harada. Institute for Digital Enhancement of Cognitive Development. Waseda University. Sendai. Japan.2010. P. 389−398.
7. TXM public website. [Online] Available from: http://textometrie.org. [accessed:23.01.2024],
8. Schmid, H. Probabilistic Part-of-Speech Tagging Using Decision Trees/ In: Proceedings International Conference on New Methods in Language Processing. Manchester. UK. Sept. 1994. P. 44–49.
9. Lavrentiev A.M., Solovev F.N., Chepovskiy A.M. Implementation in the TXM Platform of Additional Instruments of Automatic Text Processing. In: Proceedings of the international conference "Corpus linguistics – 2019" St. Petersburg University Publishing House. 2019. P. 55-62. [in Russian]
10. Benzécri,  J.-P., Bellier L. L’analyse  des  données. V. 2: L’analyse  des Correspondances. Paris: Dunod. 1976 – 616p.
11. Lê S., Josse J., Husson F. FactoMineR: an R package for multivariate analysis // Journal of statistical software.2008. № 25 (1). P. 1-18.
12. Chepovskiy, A.M. Informatsionnyye modeli v zadachakh obrabotki tekstov na estestvennykh yazykakh [Information Models for the Problems of Natural Text Processing]. 2nd ed. Moscow: Natsional’nyy otkrytyy universitet “INTUIT”. – 228p. [in Russian]
13. Lavrentiev A., Sherstinova T., Chepovskiy A., Pincemin B. Using TXM Platform for Research on Language Changes over Time: The Dynamics of Vocabulary and Punctuation in Russian Literary Texts // Vestnik Tomskogo Gosudarstvennogo Universiteta, Filologiya. 2021. Vol. 70. P. 69-89. DOI: 10.17223/19986645/70/5.
14. Applied statistics: Classifications and dimensionality reduction / S.A. Ayvazyan, V.M. Buhshtaber, I.S. Enucov, L.D. Meshalkin. / Ed. S.A. Ayvazyan. — M.: Finansy I statistika. 1989. — 607 p.
15. Bendat J., Piersol A. Prikladnoy analiz sluchainikh dannikh. Moscow: Mir, 1989. 540 p. [in Russian].
16. Deza Elena, Deza Michel Marie Dictionary of Distances. Moscow: Nauka, 2008. 444 p. . [in Russian].
 

2024-74-2
2024-74-1
2023-73-4
2023-73-3

© ФИЦ ИУ РАН 2008-2018. Создание сайта "РосИнтернет технологии".