Abstract.
Realworld data are often dirty. In most cases it negatively affects the accuracy of the model trained on such data. Supervised data correction is an expensive and timeconsuming procedure. So one of the possible ways to solve this problem is to automate the cleaning process. In this paper, we consider such a preprocessing technique for improving the quality of the trained network as automatic cleaning of training data. The proposed iterative method is based on the assumption that the polluted data are most likely located farther away from the median of the class. It includes detection and subsequent removal of the noisy data from a training set. Experiments on a generated synthetic dataset demonstrated that this method gives good results and allows to clean up the data even at high levels of pollution and significantly improve the quality of the classifier.
Keywords:
data cleaning, outlier(s) detection, mislabels, classifier, siamese neural network.
PP. 3542.
DOI 10.14357/20718632220304 References
1 J. E. Van Engelen and H. H. Hoos, “A survey on semisupervised learning,” Machine Learning 109(2), 373–440 (2020). 2 X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative study,” Artificial intelligence review 22(3), 177– 210 (2004). 3 X. Wu, Knowledge acquisition from databases, Intellect books (1995). 4 X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, “Data cleaning: Overview and emerging challenges,” in Proceedings of the 2016 international conference on management of data, 2201–2206 (2016). 5 B. Frenay and M. Verleysen, “Classification in the presence of label noise: a survey,” IEEE transactions on neural´ networks and learning systems 25(5), 845–869 (2013). 6 C. E. Brodley and M. A. Friedl, “Identifying mislabeled training data,” Journal of artificial intelligence research 11, 131–167 (1999). 7 A. L. Miranda, L. P. F. Garcia, A. C. Carvalho, and A. C. Lorena, “Use of classification algorithms in noise detection and elimination,” in International Conference on Hybrid Artificial Intelligence Systems, 417–424, Springer (2009). 8 P. Jeatrakul, K. W. Wong, and C. C. Fung, “Data cleaning for classification using misclassification analysis,” Journal of Advanced Computational Intelligence and Intelligent Informatics 14(3), 297–302 (2010). 9 J. W. Osborne and A. Overbay, “The power of outliers (and why researchers should always check for them),” Practical Assessment, Research, and Evaluation 9(1), 6 (2004). 10 C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata, “Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median,” Journal of Experimental Social Psychology 49(4), 764–766 (2013). 11 S. Seo, A review and comparison of methods for detecting outliers in univariate data sets, PhD thesis, University of Pittsburgh (2006). 12 P. J. Rousseeuw and B. C. Van Zomeren, “Unmasking multivariate outliers and leverage points,” Journal of the American Statistical association 85(411), 633–639 (1990). 13 C. Leys, O. Klein, Y. Dominicy, and C. Ley, “Detecting multivariate outliers: Use a robust variant of the mahalanobis distance,” Journal of Experimental Social Psychology 74, 150–156 (2018). 14 P. Filzmoser, A. RuizGazen, and C. ThomasAgnan, “Identification of local multivariate outliers,” Statistical Papers 55(1), 29–47 (2014). 15 M. Riani, A. C. Atkinson, and A. Cerioli, “Finding an unknown number of multivariate outliers,” Journal of the Royal Statistical Society: series B (statistical methodology) 71(2), 447–466 (2009). 16 V. V. Mazhuga and M. V. Khachumov, “Algorithms of images processing for biological systems status classification,” Information Technologies and Computational Systems (2), 54–63 (2012). 17 P. J. Rousseeuw, “Least median of squares regression,” Journal of the American statistical association 79(388), 871–880 (1984). 18 P. J. Rousseeuw, “Multivariate estimation with high breakdown point,” Mathematical statistics and applications 8(37), 283–297 (1985). 19 P. J. Rousseeuw and K. V. Driessen, “A fast algorithm for the minimum covariance determinant estimator,” Technometrics 41(3), 212–223 (1999). 20 J. Goldberger and E. BenReuven, “Training deep neuralnetworks using a noise adaptation layer,” (2016). 21 A. J. Bekker and J. Goldberger, “Training deep neuralnetworks based on unreliable labels,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2682–2686, IEEE (2016). 22 G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss correction approach,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 1944–1952 (2017). 23 A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label noise for deep neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, 31(1) (2017). 24 W. Zheng, Z. Chen, J. Lu, and J. Zhou, “Hardnessaware deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 72–81 (2019). 25 F. Fedorenko and S. Usilin, “Realtime objecttofeatures vectorisation via siamese neural networks,” in Ninth International Conference on Machine Vision (ICMV 2016), 10341, 103411R, International Society for Optics and Photonics (2017). 26 S. A. Ilyuhin, A. V. Sheshkus, and V. L. Arlazarov, “Recognition of images of korean characters using embedded networks,” in Twelfth International Conference on Machine Vision (ICMV 2019), 11433, 1143311, International Society for Optics and Photonics (2020). 27 Y. S. Chernyshova, A. V. Gayer, and A. V. Sheshkus, “Generation method of synthetic training data for mobile ocr system,” in Tenth international conference on machinevision (ICMV 2017), 10696, 106962G, International Society for Optics and Photonics (2018). 28 O. P. Soldatova and A. A. Garshin, “Convolutional neural network applied to handwritten digits recognition,” Computer optics 34(2) (2010). 29 E. Holm, A. A. Transeth, O. Ø. Knudsen, and A. Stahl, “Classification of corrosion and coating damages on bridge constructions from images using convolutional neural networks,” in Twelfth International Conference on Machine Vision (ICMV 2019), 11433, 1143320, International Society for Optics and Photonics (2020). 30 R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2, 1735–1742, IEEE (2006).
