Информационные технологии
Интеллектуальный анализ данных
N.D. Moskin, K.A. Kulakov, A.A. Rogov, R.V. Abramov "Research the Stability of Decision Trees Using Distances on Graphs"
Методы и модели в естественных науках
Компьютерный анализ текстов
N.D. Moskin, K.A. Kulakov, A.A. Rogov, R.V. Abramov "Research the Stability of Decision Trees Using Distances on Graphs"
Abstract. 

The article deals with the problem of stability of classifiers based on decision trees for the problem of text attribution. Such a task arises, for example, in the study of the authorship of articles from the pre-revolutionary journals “Time” (1861–1863), “Epoch” (1864–1865) and the weekly “Citizen” (1873–1874). The texts were divided into separate parts of different sizes using the sliding window method, then the frequency of n-grams (encoded sequences of parts of speech) in each fragment was determined. Further, these indicators were used to build various classifiers. The resulting decision trees were compared with each other using the tree edit distance. For this purpose, a procedure for processing, comparing and visualizing graphs was implemented in the SMALT software package. As a result of experiments using different weights for editing operations, patterns were revealed between the parameters for constructing text fragments and the decision trees obtained on their basis.

Keywords: 

text attribution, n-gram, decision tree, graph matching, tree edit distance, software complex “SMALT”.

Стр. 94-100.

DOI: 10.14357/20790279230111
 
 
References

1. Abramov R. V., Kulakov K. A., Lebedev A. A., Moskin N. D., Rogov A. A. 2021. Research of features of Dostoevsky’s publicistic style by using n-grams based on the materials of the “Time” and “Epoch” magazines. Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes. Saint Petersburg, 17(4): 389–396.
2. Safavian S. R., Landgrebe D. 1991. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3): 660–674.
3. Pedregosa F., et al. 2011. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12: 2825–2830.
4. Lewis R. J. 2000. An introduction to classification and regression tree (CART) analysis. Annual meeting of the society for academic emergency medicine in San Francisco, California. Citeseer 14.
5. Coppersmith D., Hong S. J., Hosking J. R. M. 1999. Partitioning nominal attributes in decision trees. Data Mining and Knowledge Discovery, 3(2): 197–217.
6. Conte D., Foggia P., Sansone C., Vento M. 2004. Thirty years of graph matching in pattern recognition. International Journal of Pattern Recognition and Artificial Intelligence, 18(3): 265–298.
7. Jiang X., Bunke H. 2008. Graph matching. Case- Based Reasoning on Images and Signals. Vol. 73 of Studies in Computational Intelligence. Springer, 149–173.
8. Riesen K. 2015. Structural Pattern Recognition with Graph Edit Distance: Approximation Algorithms and Applications. Advances in Computer Vision and Pattern Recognition. Springer, Heidelberg. 158 p.
9. Bille P. 2003. Tree edit distance, alignment distance and inclusion. IT University of Copenhagen. Technical Report Series, TR-2003-23.
10. Jiang T., Wang L., Zhang K. 1995. Alignment of trees – an alternative to tree edit. Theoretical Computer Science, 143(1): 137–148.
11. Torsello A., Hidovic D., Pelillo M. 2004. Four metrics for efficiently comparing attributed trees. Proc. ICPR’04 – 17th International Conference on Pattern Recognition, 2: 467–470.
12. Torsello A., Hidovic D., Pelillo M. 2005. Polynomial time metrics for attributed trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7): 1087–1099.
13. Kuznetsov A. V. 2017. Mera neskhodstva na mnozhestve grafov i ee prilozheniya [A measure of dissimilarity on a set of graphs and its applications]. Vestnik Voronezhskogo gosudarstvennogo universiteta. Seriya: sistemnyj analiz i informacionnye tekhnologii [Bulletin of the Voronezh State University. Series: system analysis and information technology], 1: 125–131.
14. Isert C. 1999. The editing distance between trees. Ferienakademie Bäume: Algorithmik und Kombinatorik. Sarntal, Italy.
15. Rogov A. A., Abramov R. V., Buchneva D. D., Zakharova O. V., Kulakov K. A., Lebedev A. A., Moskin N. D., Otlivanchik A. V., Savinov E. D., Sidorov Y. V. 2021. Problema atribucii v zhurnalah “Vremya”, “Epoha” i ezhenedel’nike “Grazhdanin” [The problem of attribution in the magazines “Time”, “Epoch” and the weekly “Citizen”]. Petrozavodsk: Islands, 391 p.
16. Wu L., Chen Y., Shen K., Guo X., Gao H., Li S., Pei J., Long B. 2021. Graph Neural Networks for Natural Language Processing: A Survey. ArXiv abs/2106.06090.
17. Hlaoui A., Wang S. 2003. A New Median Graph Algorithm. Graph Based Representations in Pattern Recognition (GbRPR 2003). Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2726: 225–234.
 
2024-74-1
2023-73-4
2023-73-3
2023-73-2

© ФИЦ ИУ РАН 2008-2018. Создание сайта "РосИнтернет технологии".