Журнал «Труды Института системного анализа Российской академии наук» - A.P. Zavyalova, P.A. Martynyuk, R.S. Samarev "Sentence splitters benchmark"

Просматривается номер 2023-73-1

There are multiple implementations of text into sentences splitters including open source libraries and tools. But the quality of segmentation and the performance of each segmentation tool are very different. Moreover, it is convenient for NLP developers to have all libraries written in the same programming language, except when using some kind of integration programming language. This paper considers two aspects - building a uniform framework and estimating language features of the modern and popular programming language Julia from one side. And the performance estimation of sentence splitting libraries as is. The paper contains detailed performance results, samples of texts after splitting, and a list of some typical issues related to sentence splitting.

segmentation, sentence, splitting, NLP, Julia language, benchmark, text analysis.

DOI: 10.14357/20790279230119

1. Text to sentence splitter. https://github.com/mediacloud/sentence-splitter, 2019. Accessed: 2022-01-20.

2. Apache. Opennlp. http://opennlp.apache.org, 2010. Accessed: 2022-01-20.

3. Bird, S., Klein, E., and Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. “ O’Reilly Media, Inc.”, 2009.

4. Bolshakova, E.I., Peskova, O., Klyshinsky, E., Noskov, A.A., Lande, D., and Yagunova, E.V. Automatic natural language processing and computational linguistics, 2015.

5. Chen, J., and Revels, J. Robust benchmarking in noisy environments. arXiv e-prints (Aug 2016).

6. Community, T.J. Calling c and fortran code, may 2022.

7. Community, T.J. Why we use julia, 10 years later, february 2022.

8. Honnibal, M., and Johnson, M. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (Lisbon, Portugal, Sept. 2015), Association for Computational Linguistics, pp. 1373–1378.

9. Honnibal, M., and Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. 2017.

10. Koehn, P., et al. Europarl: A parallel corpus for statistical machine translation. In MT summit (2005), vol. 5, Citeseer, pp. 79–86.

11. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (2014), pp. 55–60.

12. Nivre, J., and Nilsson, J. Pseudo-projective dependency parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05) (Ann Arbor, Michigan, June 2005), Association for Computational Linguistics, pp. 99–106.

13. Ruopp, A. Lingua sentence. https://metacpan.org/pod/Lingua::Sentence, 2010. Accessed: 2022-01-20.

14. Sætre, R., Søvik, H., Amble, T., and Tsuruoka, Y. Genetuc, genia and google: Natural language understanding in molecular biology literature. In Transactions on Computational Systems Biology V (Berlin, Heidelberg, 2006), C. Priami, X. Hu, Y. Pan, and T. Y. Lin, Eds., Springer Berlin Heidelberg, pp. 68–82.

15. Soricut, R., and Marcu, D. Sentence level discourse parsing using syntactic and lexical information. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (2003), pp. 228–235.

16. Zeldes, A. The GUM corpus: Creating multilayer resources in the classroom. Language Resources and Evaluation 51, 3 (2017), 581–612.