D. R. Potapov Implementation of the module for determining complex load parameters for self-adapting data containers
D. R. Potapov Implementation of the module for determining complex load parameters for self-adapting data containers


In applications with a large amount of the static data or data which is using for reading mostly cache applying improves performance greatly. To achieve maximum efficiency in an adaptive data storage implementation cache size can be changed dynamically during execution based on difference between speed of a main container and the cache, and container load. The main parameter of load is a set of requesting data, which in common case can be described as Gaussian distribution. But in a real world the container load is a set of simple loads mostly, because requests to data storage can be made by many applications or different tasks. Thus, parameters of such loads should be identified to achieve cache maximum efficiency. This paper provides implementation of the module for determining complex load parameters for self-adapting data containers results. The choice of EM modification, k-means++ initialization, and module structure brief description are also explained in this work. Clustering quality (for one and many clusters, concepts drift and time frame) and module execution time in this research are analyzed. Based on tests results, it can be said, that this module is good enough for determining complex load parameters and can be used in self-adapting data containers effectively.


store the data, cache efficiency, optimal data storage, adaptive data container, container load, gaussian mixture model, clustering, EM, k-means.

PP. 87-95.

DOI 10.14357/20718632190108


1. Potapov, D. R., M. A. Artemov, and E. S. Baranovskii. 2017. Obzor uslovii adaptatsii samoadaptiruyushchikhsya assotsiativnykh konteinerov dannykh [Review adaptation conditions of adaptive associative data storages]. Vestnik Voronezhskogo gosudarstvennogo universiteta. Seriya: Sistemnyi analiz i informatsionnye tekhnologii 1: 112-119.
2. Zobov, V. V., and K. E. Seleznev. 2014. Instrument dlya modelirovaniya nagruzki na konteinery dannykh [Tool for modeling the load on data containers]. Materialy chetyrnadtsatoi nauchno-metodicheskoi konferentsii «Informatika: problemy, metodologiya, tekhnologii 3: 154–161.
3. Potapov, D. R. 2018. Existing methods of multidimensional «key-value» storages construction for using in adaptive data storages review. JOURNAL OF APPLIED INFORMATICS 2(74): 69-82.
4. Potapov, D. R., M. A. Artemov, E. S. Baranovskii, and K.E. Seleznev. 2017. Obzor metodov postroenija kontejnerov dannyh «kljuch-znachenie» dlja ispol'zovanija v samoadaptirujushhihsja kontejnerah dannyh [Existing methods of “key-value” storages construction for using in adaptive data storages review]. Kibernetika i programmirovanie. 5:14-45.
5. Potapov, D. R. Issledovanie jeffektivnosti primenenija kesha dlja ispol'zovanija v samoadaptirujuwihsja kontejnerah dannyh [Cache efficiency research for using in adaptive data storage]. (In Russian, Unpubl.)
6. Bishop, C. 2006. Pattern Recognition and Machine Learning. Heidelberg: Springer. 738 p.
7. McLachlan, G., and D. Peel. 2004. Finite Mixture Models. NY: John Wiley & Sons. 419 p.
8. Korolev, V.U. 2007. EM-algoritm, ego modifikacii i ih primenenie k zadache razdelenija smesej verojatnostnyh raspredelenij. Teoreticheskij obzor [EM-algorithm, itsmodifications and their application to the problem of separation of mixtures of probability distributions. Theoretical review]. Moscow: IPI RAN. 102 p.
9. McLachlan, G., and T. Krishnan. 1997. The EM algorithm and extensions. Wiley series in probability and statistics. NY: John Wiley & Sons. 400 p.
10. Aggarwal, C.C., J. Han, J. Wang, and P.S. Yu. 2003. A framework for clustering evolving data streams. Proceedings of the 29th international conference on Very large data bases. Berlin. 81-92.
11. Liang, P., and D. Klein. 2009. Online EM for unsupervised models. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Boulder. 611-619.
12. Blomer, J., and K. Bujna. 2013. Simple methods for initializing the EM algorithm for Gaussian mixture models. Computing Research Repository. Vol. abs/1312.5946.
13. Baudry, J.-P., and G. Celeux. 2015. EM for Mixtures. Statistics and Computing 25(4): 713–726.
14. Melnykov, V., and I. Melnykov. 2012. Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Computational Statistics & Data Analysis. 56(6): 1381-1395.
15. Biernacki, C., G. Celeux, and G. Govaert. 2003. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis. 41(3-4): 561–575.
16. Meila, M., and D. Heckerman. 1998. An Experimental Comparison of Several Clustering and Initialization Methods. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. San Francisco. 386–395.
17. Arthur, D., and S. Vassilvitskii. 2007. K-means++: The Advantages of Careful Seeding. Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms. New Orleans. 1027–1035.
18. Bahmani, B., B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. 2012. Scalable k-means++. Proceedings of the VLDB Endowment. 5(7): 622-633.
19. Zhao, W., H. Ma, and Q. He. 2009. Parallel K-means clustering based on mapReduce. Proceedings of the 1st International Conference on Cloud Computing. Heidelberg. 674-679.
20. Xu, Y., W. Qu, Z. Li, C. Ji, Y. Li, and Y. Wu. 2014. Fast Scalable k-means++ Algorithm with MapReduce. Algorithms and Architectures for Parallel Processing. ICA3PP 2014 8631: 15-28.
21. Unsupervised machine learning with multivariate Gaussian mixture model which supports both offline data and real-time data stream.
22. Kruglov, V. M., and V. U. Korolev. 1990. Predel'nye teoremy dlja sluchajnyh sum [Limit theorems for random sums]. Moscow: Moscow University Publishing. 269 p.
23. Gmurman, V. E. 2014. Teoriya veroyatnostej i matematicheskaya statistika : uchebnik dlya prikladnogo bakalavriata [Theory of Probability and Mathematical Statistics: A Textbook for Applied Bachelor Degree]. Moscow: Urait. 479p.
24. Bradley, P.S., U. M. Fayyad, and C. A. Reina. 1999. Scaling EM (Expectation-Maximization) Clustering to Large Databases. Microsoft Research Technical Report MSRTR-98-35.

2024 / 02
2024 / 01
2023 / 04
2023 / 03

© ФИЦ ИУ РАН 2008-2018. Создание сайта "РосИнтернет технологии".