Журнал «Информационные технологии и вычислительные системы» - N. Z. Valishina, S. A. Ilyuhin, A. V. Sheshkus, V. L. Arlazarov "Automatic Training Data Filtering for Errors Removing and Improving the Quality of the Final Neural Network"

Просматривается номер 2022 / 03

Real-world data are often dirty. In most cases it negatively affects the accuracy of the model trained on such data. Supervised data correction is an expensive and time-consuming procedure. So one of the possible ways to solve this problem is to automate the cleaning process. In this paper, we consider such a preprocessing technique for improving the quality of the trained network as automatic cleaning of training data. The proposed iterative method is based on the assumption that the polluted data are most likely located farther away from the median of the class. It includes detection and subsequent removal of the noisy data from a training set. Experiments on a generated synthetic dataset demonstrated that this method gives good results and allows to clean up the data even at high levels of pollution and significant-ly improve the quality of the classifier.

data cleaning, outlier(s) detection, mislabels, classifier, siamese neural network.

DOI 10.14357/20718632220304

1 J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine Learning 109(2), 373–440 (2020).

2 X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative study,” Artificial intelligence review 22(3), 177– 210 (2004).

3 X. Wu, Knowledge acquisition from databases, Intellect books (1995).

4 X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, “Data cleaning: Overview and emerging challenges,” in Proceedings of the 2016 international conference on management of data, 2201–2206 (2016).

5 B. Frenay and M. Verleysen, “Classification in the presence of label noise: a survey,” IEEE transactions on neural´ networks and learning systems 25(5), 845–869 (2013).

6 C. E. Brodley and M. A. Friedl, “Identifying mislabeled training data,” Journal of artificial intelligence research 11, 131–167 (1999).

7 A. L. Miranda, L. P. F. Garcia, A. C. Carvalho, and A. C. Lorena, “Use of classification algorithms in noise detection and elimination,” in International Conference on Hybrid Artificial Intelligence Systems, 417–424, Springer (2009).

8 P. Jeatrakul, K. W. Wong, and C. C. Fung, “Data cleaning for classification using misclassification analysis,” Journal of Advanced Computational Intelligence and Intelligent Informatics 14(3), 297–302 (2010).

9 J. W. Osborne and A. Overbay, “The power of outliers (and why researchers should always check for them),” Practical Assessment, Research, and Evaluation 9(1), 6 (2004).

10 C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata, “Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median,” Journal of Experimental Social Psychology 49(4), 764–766 (2013).

11 S. Seo, A review and comparison of methods for detecting outliers in univariate data sets, PhD thesis, University of Pittsburgh (2006).

12 P. J. Rousseeuw and B. C. Van Zomeren, “Unmasking multivariate outliers and leverage points,” Journal of the American Statistical association 85(411), 633–639 (1990).

13 C. Leys, O. Klein, Y. Dominicy, and C. Ley, “Detecting multivariate outliers: Use a robust variant of the mahalanobis distance,” Journal of Experimental Social Psychology 74, 150–156 (2018).

14 P. Filzmoser, A. Ruiz-Gazen, and C. Thomas-Agnan, “Identification of local multivariate outliers,” Statistical Papers 55(1), 29–47 (2014).

15 M. Riani, A. C. Atkinson, and A. Cerioli, “Finding an unknown number of multivariate outliers,” Journal of the Royal Statistical Society: series B (statistical methodology) 71(2), 447–466 (2009).

16 V. V. Mazhuga and M. V. Khachumov, “Algorithms of images processing for biological systems status classification,” Information Technologies and Computational Systems (2), 54–63 (2012).

17 P. J. Rousseeuw, “Least median of squares regression,” Journal of the American statistical association 79(388), 871–880 (1984).

18 P. J. Rousseeuw, “Multivariate estimation with high breakdown point,” Mathematical statistics and applications 8(37), 283–297 (1985).

19 P. J. Rousseeuw and K. V. Driessen, “A fast algorithm for the minimum covariance determinant estimator,” Technometrics 41(3), 212–223 (1999).

20 J. Goldberger and E. Ben-Reuven, “Training deep neural-networks using a noise adaptation layer,” (2016).

21 A. J. Bekker and J. Goldberger, “Training deep neural-networks based on unreliable labels,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2682–2686, IEEE (2016).

22 G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss correction approach,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 1944–1952 (2017).

23 A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label noise for deep neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, 31(1) (2017).

24 W. Zheng, Z. Chen, J. Lu, and J. Zhou, “Hardness-aware deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 72–81 (2019).

25 F. Fedorenko and S. Usilin, “Real-time object-to-features vectorisation via siamese neural networks,” in Ninth International Conference on Machine Vision (ICMV 2016), 10341, 103411R, International Society for Optics and Photonics (2017).

26 S. A. Ilyuhin, A. V. Sheshkus, and V. L. Arlazarov, “Recognition of images of korean characters using em-bedded networks,” in Twelfth International Conference on Machine Vision (ICMV 2019), 11433, 1143311, International Society for Optics and Photonics (2020).

27 Y. S. Chernyshova, A. V. Gayer, and A. V. Sheshkus, “Generation method of synthetic training data for mobile ocr system,” in Tenth international conference on machinevision (ICMV 2017), 10696, 106962G, International Society for Optics and Photonics (2018).

28 O. P. Soldatova and A. A. Garshin, “Convolutional neural network applied to handwritten digits recognition,” Computer optics 34(2) (2010).

29 E. Holm, A. A. Transeth, O. Ø. Knudsen, and A. Stahl, “Classification of corrosion and coating damages on bridge constructions from images using convolutional neural networks,” in Twelfth International Conference on Machine Vision (ICMV 2019), 11433, 1143320, International Society for Optics and Photonics (2020).

30 R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2, 1735–1742, IEEE (2006).