INTELLIGENCE SYSTEMS AND TECHNOLOGIES
V. N. Gridin, V. I. Solodovnikov, D. S. Smirnov, V. P. Kolb, P. V. Bochkaryov, I. A. Kuznetcov Study of the Relationship between Data Generation Indicators and the Effectiveness of Solving Target Problems
MATHEMATICAL FOUNDATIONS OF INFORMATION TECHNOLOGY
APPLIED ASPECTS OF COMPUTER SCIENCE
MATHEMATICAL MODELLING
V. N. Gridin, V. I. Solodovnikov, D. S. Smirnov, V. P. Kolb, P. V. Bochkaryov, I. A. Kuznetcov Study of the Relationship between Data Generation Indicators and the Effectiveness of Solving Target Problems
Abstract. 

In the modern world, artificial intelligence (AI) technologies are becoming an important part of life, and the problem of lack of data for training models is becoming relevant. Limited access to real data due to privacy and lack of information hinders the development of AI and machine learning-based systems. In recent years, so-called “trusted AI” systems have also been actively developing, focusing on safety, reliability, and ethics. These systems eliminate the problems of bias and opacity of algorithms by providing explanations for their decisions. In response to the lack of data, the concept of synthetic data arises, which allows AI models to learn on artificially created but realistic data. This approach helps to overcome the difficulties associated with the lack of real data and contributes to the creation of more effective and unbiased AI models. This paper considers the possibility of using data generation quality indicators as an indicator of the quality of using this data for machine learning tasks. 

Keywords: 

synthetic data, regression, classification methods, GAN, VAE, correlation analysis.

DOI 10.14357/20718632250105 

EDN QDVKHQ

PP. 53-64.

References

1. Decree of the President of the Russian Federation of May 7, 2024 no. 309 “ On the national development goals of the Russian Federation for the period up to 2030 and for the perspective up to 2036 ”. garant.ru URL: https://www.garant.ru/products/ipo/prime/doc/408892634/ 
2. Gartner, "Maverick Research: Forget About Your Real Data - Synthetic Data Is the Future of AI", Leinar Ramos, Jitendra Subramanyam, 24. Juni 2021
3. Templ M. et al. Simulation of synthetic complex data: The R package simPop. Journal of Statistical Software. – 2017. – Т. 79. – №. 10. – С. 1-38.
4. Prédhumeau, M., Manley, E. A synthetic population for agent-based modelling in Canada. Sci Data 10, 148 (2023). https://doi.org/10.1038/s41597-023-02030-4
5. S. Somanath, L. Thuvander and A. Hollberg. An activitybased synthetic population of Gothenburg, Sweden: Dataset of residents in neighbourhoods .Data in Brief 57 (2024) 110945. https://doi.org/10.1016/j.dib.2024.110945
6. L. T. Burra, M. B. Al-Khasawneh, C. Cirillo. Impact of charging infrastructure on electric vehicle adoption: A synthetic population approach. Travel Behaviour and Society. Volume 37, October 2024, 100834. https://doi.org/10.1016/j.tbs.2024.100834.
7. Samuel Felbermair et al. Generating synthetic population with activity chains as agent-based model input using statistical raster census data. Procedia Computer Science 170 (2020) 273–280.
8. The benefits and limitations of generating synthetic data. Syntheticus URL: https://syntheticus.ai/blog/the-benefitsand-limitations-of-generating-synthetic-data 
9. Lovelace, R., Dumont, M., Ellison, R. & Zaloznik, M. Spatial Microsimulation ith R (Chapman and Hall/CRC, 2016) 
10. Prédhumeau M., Manley E. A synthetic population for agent-based modelling in Canada .Scientific Data. – 2023. – Т. 10. – №. 1. – С. 148.
11. Snoke, Joshua & Raab, Gillian & Nowok, Beata & Dibben, Chris & Slavkovic, Aleksandra. (2016). General and Specific Utility Measures for Synthetic Data. Journal of the Royal Statistical Society: Series A (Statistics in Society). 181. 10.1111/rssa.12358.
12. Raab G. M., Nowok B., Dibben C. Assessing, visualizing and improving the utility of synthetic data .arXiv preprint arXiv:2109.12717. – 2021.
13. Woo M., Reiter J., Oganian A., Karr A.. Global Measures of Data Utility for Microdata Masked for Disclosure Limitation. Journal of Privacy and Confidentiality. (2009) 1. 10.29012/jpc.v1i1.568.
14. Reiter, J. (2005). Using CART to Generate Partially Synthetic, Public Use Microdata. Journal of Official Statistics. 21 
15. Jörg Drechsler, Jerome P. Reiter, An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets, Computational Statistics & Data Analysis,2011, Pages 3232-3243.
16. Caiola, Gregory & Reiter, Jerome. (2010). Random Forests for Generating Partially Synthetic, Categorical Data. Transactions on Data Privacy. 3. 27-42. 
17. GaussianCopulaSynthesizer . Synthetic Data Vault URL: https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/gaussiancopulasynthesizer 
18. CopulaGANSynthesizer. Synthetic Data Vault URL: https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/copulagansynthesizer
19. Xu, Lei & Skoularidou, Maria & Cuesta-Infante, Alfredo & Veeramachaneni, Kalyan. (2019). Modeling Tabular data using Conditional GAN. 
20. Lederrey, G., Hillel, T., & Bierlaire, M. (2022). DATGAN: Integrating expert knowledge into deep learning for synthetic tabular data. ArXiv, abs/2203.03489 
21. Yoon, Jinsung & Drumright, Lydia & Schaar, Mihaela. (2020). Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN). IEEE Journal of Biomedical and Health Informatics. PP. 1-1. 10.1109/JBHI.2020.2980262
22. Jordon, J., Yoon, J., & Schaar, M.V. (2018). PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. International Conference on Learning Representations.
23. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A.C. (2017). Improved Training of Wasserstein GANs. Neural Information Processing Systems.
24. Martin A., Soumith Ch., Léon B. Wasserstein Generative Adversarial Networks. Proceedings of the 34th International Conference on Machine Learning. - 2017. - С. 214-223.
25. Xian, Y., Lorenz, T., Schiele, B., & Akata, Z. (2017). Feature Generating Networks for Zero-Shot Learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5542-5551
26. Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional GAN. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 659, 7335–7345.
27. Akrami, H., Aydöre, S., Leahy, R.M., & Joshi, A.A. (2020). Robust Variational Autoencoder for Tabular Data with Beta Divergence. ArXiv, abs/2006.08204.
28. Arami, H., Joshi, A.A., Li, J., Aydore, S., & Leahy, R.M. (2021). A robust variational autoencoder using beta divergence. Knowledge-based systems, 238.
29. Tu Nguyen Thieu Nguyen, Binh Minh Nguyen, and Giang Nguyen. Efficient time-series forecasting using neural network and opposition-based coral reefs optimization. International Journal of Computational Intelligence Systems, 12(2):1144–1161, 2019
30. J.D. Sachs, R. Layard, J.F. Helliwell, World Happiness Report 2018, eSocialSciences, 2018. https://ideas.repec.org/p/ess/wpaper/id12761.html
31. A. Falk, A. Becker, T. Dohmen, B. Enke, D. Huffman, U. Sunde, Global Evidence on Economic Preferences, The Quarterly Journal of Economics. 133 (2018) 1645–1692. doi:10.1093/qje/qjy013
2025 / 03
2025 / 02
2025 / 01
2024 / 04

© ФИЦ ИУ РАН 2008-2018. Создание сайта "РосИнтернет технологии".