Abstract.
The paper considers approaches to accounting for unknown words in language models used in natural language processing algorithms. A method is proposed for accounting for unknown words in probabilistic topic modeling, which allows to determine the probability of a document’s novelty in relation to existing topics. Topic models calculate the probabilistic assessment of classifying a word to some topic. The word-topic probabilistic relationship matrix in such a model is filled with posterior values of word probabilities. To calculate the probabilistic assessment of a document’s novelty, this paper proposes to introduce the concept of a penalty for obscurity or an a priori probability estimate for unknown words into the model. A software prototype has been developed that allows calculating the probability of a document’s novelty taking into account various penalty values. Experiments were conducted on the SCTM-ru text corpus, demonstrating the capabilities of the method for classifying collections and flows of text documents containing unknown words that reflect their influence on the topic of documents. During the experiments, the classification results were also compared using a thematic model and a classifier model based on logistic regression.
Keywords:
topic modeling, natural language processing, penalty unknown words
PP. 101-124.
DOI 10.14357/20718632200410 References
1. Krylova M. N. Jazyk kak dinamicheskaja sistema [Language as a dynamic system] // Modeli, sistemy, seti v ehkonomike, tekhnike, prirode i obshhestve [Models, systems, networks in economics, engineering, nature and society]. – 2014. – №. 1 (9). 1. 2. Wang C., Blei D., Heckerman D. Continuous time dynamic topic models. preprint arXiv:1206.3298. – 2012. 3. Hoffman M., Bach F. R., Blei D. M. Online learning for latent dirichlet allocation. Advances in neural information processing systems. – 2010. – С. 856-864. 4. Zhai K., Boyd-Graber J. L. Online Latent Dirichlet Allocation with Infinite Vocabulary. ICML (1). – 2013. – Т. 28. – С. 561-569. 5. Lau J. H., Collier N., Baldwin T. On-line Trend Analysis with Topic Models: \# twitter Trends Detection Topic Model Online. COLING. – 2012. – С. 1519-1534. 6. Karpovich S.N. Tematicheskaja model s beskonechnym slovarem [Topic Model with an Infinite Vocabulary] // Informazionno-Upravlyaushie Sistemy [Information & Control Systems]. 2016. No6S. 43-49. doi:10.15217/issn1684-8853.2016.6.43 (VAK). 7. Karpovich S. N., Smirnov A. V., Teslja N. N. Odnoklassovaja klassifikacija tekstovykh dokumentov s ispol'zovaniem verojatnostnogo tematicheskogo modelirovanija [Positive Example Based Learning-TM] // Iskusstvennyjj intellekt i prinjatie reshenijj [Artificial Intelligence and Decision Making]. – 2018. – №. 3. – S. 69-77. 8. Goldberg Y., Hirst G. Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers(2017) //9781627052986 (zitiert auf Seite 69). 9. Berger A., Lafferty J. Information retrieval as statistical translation //ACM SIGIR Forum. – New York, NY, USA: ACM, 2017. – Т. 51. – №. 2. – С. 219-226. 10. Wallach H. M. Topic modeling: beyond bag-of-words //Proceedings of the 23rd international conference on Machine learning. – 2006. – С. 977-984. 11. Mikolov T. et al. Efficient estimation of word representations in vector space //arXiv preprint arXiv:1301.3781. – 2013. 12. Rong X. Word2vec parameter learning explained //arXiv preprint arXiv:1411.2738. – 2014. 13. Pennington J., Socher R., Manning C. D. Glove: Global vectors for word representation //Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). – 2014. – С. 1532-1543. 14. Devlin J. et al. Bert: Pretraining of deep bidirectional transformers for language understanding //arXiv preprint arXiv:1810.04805. – 2018. 15. Joulin A. et al. Fasttext. zip: Compressing text classification models //arXiv preprint arXiv:1612.03651. – 2016. 16. Brown T. B. et al. Language models are few-shot learners //arXiv preprint arXiv:2005.14165. – 2020. 17. Lau J. H., Baldwin T. An empirical evaluation of doc2vec with practical insights into document embedding generation //arXiv preprint arXiv:1607.05368. – 2016. 18. Le Q., Mikolov T. Distributed representations of sentences and documents //International conference on machine learning. – 2014. – С. 1188-1196. 19. Reimers N., Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks //arXiv preprint arXiv:1908.10084. – 2019. 20. Chen W. et al. How large a vocabulary does text classification need? a variational approach to vocabulary selection //arXiv preprint arXiv:1902.10339. – 2019. 21. Chirkova N., Lobacheva E., Vetrov D. Bayesian compression for natural language processing //arXiv preprint arXiv:1810.10927. – 2018. 22. Hoffman T. Probabilistic Latent Semantic Indexing // Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. — 1999. – С. 50-57. 23. Blei D.M., Ng A.Y., Jordan M. I. Latent Dirichlet Allocation // Journal of Machine Learning Research. — 2003. – Т. 3. – №. Jan. – С. 993-1022. 24. Moon T. K. The expectation-maximization algorithm //IEEE Signal processing magazine. – 1996. – Т. 13. – №. 6. – С. 47-60. 25. Vorontsov K.V., Potapenko A.A. Modifikacii EM-algoritma dlja verojatnostnogo tematicheskogo modelirovanija [EM-like algorithms for probabilistic topic modeling] // Mashinnoe obuchenie i analiz dannyh [Ma-chine Learning and Data Mining]. – 2013. – Т. 1. – №. 6. – С. 657-686. 26. Karpovich S.N. Mnogoznachnaja klassifikacija tekstovykh dokumentov s ispol'zovaniem verojatnostnogo tematicheskogo modelirovanija ml-PLSI [Multi-Label Classification of Text Documents using Probabilistic Topic Model ml-PLSI.] // Trudy SPIIRAN [SPIIRAS Proceedings]. – SPb.,2016. –T. 4. –No. 47. –S. 92-104 (VAK, Scopus) 27. Vorontsov K., Potapenko A. Additive regularization of topic models //Machine Learning. – 2015. – Т. 101. – №. 1-3. – С. 303-323. 28. Pedregosa F. et al. Scikit-learn: Machine learning in Python //Journal of machine learning research. – 2011. – Т. 12. – №. Oct. – С. 2825-2830. 29. Karpovich S.N. Russkojazychnyjj korpus tekstov SCTM-ru dlja postroenija tematicheskikh modelejj [The Russian language text corpus for testing algorithms of topic model] // Trudy SPIIRAN [SPIIRAS Proceedings]. –SPb., 2015.–No39. C. 123-142. UDK 004.912(VAK) 30. Ianina A., Vorontsov K. Regularized multimodal hierarchical topic model for document-by-document exploratory search //2019 25th Conference of Open Innovations Association (FRUCT). – IEEE, 2019. – С. 131-138. 31. Vorontsov K. et al. Non-Bayesian additive regularization for multimodal topic modeling of large collections //Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications. – 2015. – С. 29-37.
|