A Review of Missing Data Handling Techniques for Machine Learning

Authors

  • Luke Oluwaseye Joel Institute for Intelligent Systems, University of Johannesburg, Johannesburg, South Africa
  • Wesley Doorsamy Institute for Intelligent Systems, University of Johannesburg, Johannesburg, South Africa
  • Babu Sena Paul Institute for Intelligent Systems, University of Johannesburg, Johannesburg, South Africa

DOI:

https://doi.org/10.15157/IJITIS.2022.5.3.971-1005

Keywords:

Machine learning; Data; Missing Data; Techniques; Classification model

Abstract

Real-world data are commonly known to contain missing values, and consequently affect the performance of most machine learning algorithms adversely when employed on such datasets. Precisely, missing values are among the various challenges occurring in real-world data. Since the accuracy and efficiency of machine learning models depend on the quality of the data used, there is a need for data analysts and researchers working with data, to seek out some relevant techniques that can be used to handle these inescapable missing values. This paper reviews some state-of-art practices obtained in the literature for handling missing data problems for machine learning. It lists some evaluation metrics used in measuring the performance of these techniques. This study tries to put these techniques and evaluation metrics in clear terms, followed by some mathematical equations. Furthermore, some recommendations to consider when dealing with missing data handling techniques were provided.

References

Kang H. The prevention and handling of the missing data. Korean journal of anesthesiology, 2013; 64(5); 402-406.

Yan T. and Curtin R. The relation between unit nonresponse and item non-response: A response continuum perspective. International Journal of Public Opinion Research, 2010; 22(4); 535-551.

De Leeuw E.D., Hox J.J. and Huisman M. Prevention and treatment of item nonresponse. Journal of Official Statistics, 2003; 19; 153-176.

Chaitanya Baweja. Employees.csv. Available at: https://github.com/ChaitanyaBaweja/Programming-Tutorials/blob/master/Missing-Data-Pandas/employees.csv, Accessed on 21 December 2020.

Ochieng’Odhiambo F. Comparative study of various methods of handling missing data. Mathematical Modelling and Applications, 2020; 5(2); 87-93.

Little R.J.A. and Rubin D.B. (2002) Statistical Analysis with Missing Data. John Wiley & Sons, Inc., USA.

Song Q. and Shepperd M. Missing data imputation techniques. International journal of business intelligence and data mining, 2(3); 261-291; 2007.

Makaba T. and Dogo E. A comparison of strategies for missing values in data on machine learning classification algorithms. In 2019 International Multidisciplinary Information Technology and Engineering Conference (IMITEC), 2019, pp. 1-7.

Sim J., Lee J.S. and Kwon O. Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Mathematical problems in engineering, 2015; 2015; 1-14.

Stef Van Buuren. (2018) Flexible imputation of missing data. CRC press, USA.

IBM Support. (2020) Pairwise vs. listwise deletion: What are they and when should I use them? Available at: https://www.ibm.com/support/pages/pairwise-vs-listwise-deletion-what-are-they-and-when-should-i-use-them Accessed on 21 December 2020.

Houari R., Bounceur A., Tari A.K. and Kecha M.T. Handling missing data problems with sampling methods. In 2014 International Conference on Advanced Networking Distributed Systems and Applications, 2014, pp. 99-104.

Hadeed S.J, O’Rourke M.K., Burgess J.L, Harris R.B. and Canales R.A. Imputation methods for addressing missing data in short-term monitoring of air pollutants. Science of The Total Environment, 2020; 730; 139140.

Malarvizhi M.R. and Thanamani A.S. Comparison of imputation techniques after classifying the dataset using KNN classifier for the imputation of missing data. International Journal of Computational Engineering Research, 2013; 3(1); 101-104.

Zhang Z. Missing data imputation: focusing on single imputation. Annals of translational medicine, 2016; 4(1); 1-8.

Lachin J.M. Fallacies of last observation carried forward analyses. Clinical trials, 2016; 13(2); 161-168.

Kenward M.G. and Molenberghs G. Last observation carried forward: a crystal ball? Journal of biopharmaceutical statistics, 2009; 19(5); 872-888.

Harel O. and Zhou X. Multiple imputation: review of theory, implementation, and software. Statistics in medicine, 2007; 26(16); 3057-3077.

Rubin D.B. Multiple imputation after 18+ years. Journal of the American statistical Association, 1996; 91(434); 473-489.

Schafer J. Multiple imputation in multivariate problems when the imputation and analysis models differ. Statistica Neerlandica, 2002; 57; 19–35.

Schafer J.L. (1997) Analysis of incomplete multivariate data. CRC press, USA.

Rubin D.B. (2004) Multiple imputation for nonresponse in surveys. John Wiley & Sons, USA.

Royston P. and White I.R. Multiple imputation by chained equations (mice): implementation in stata. J Stat Softw, 2011; 45(4); 1-20.

Schafer J. L. Multiple imputation: a primer. Statistical methods in medical research, 1999; 8(1); 3-15.

Honaker J., King G. and Blackwell M. Amelia II: A program for missing data. Journal of statistical software, 2011; 45(7); 1-47.

Raghunathan T., Solenberger P., Berglund P. and Hoewyk J.V. (2016) Iveware: Imputation and variance estimation software (version 0.3). University of Michigan.

Graham J. W. Multiple imputation and analysis with SPSS 17-20. (2012) In Missing Data. Springer, Germany.

Ginkel J.R.V. and Ark L.A.V.D. Spss syntax for missing value imputation in test and questionnaire data. Applied Psychological Measurement, 2005; 29(2); 152-153.

Yuan Y. Multiple imputation using SAS software. J Stat Softw, 2011; 45(6); 1-25.

Azur M.J., Stuart E.A., Frangakis C. and Leaf P.J. Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research, 2011; 20(1); 40-49.

Wulff J.N. and Ejlskov L. Multiple imputation by chained equations in praxis: Guidelines and review. Electronic Journal of Business Research Methods, 2017; 15(1); 41-55.

Buuren S. V. and Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in R. Journal of statistical software, 2010; 45(3); 1-68.

White I.R., Royston P. and Wood A.M. Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 2011; 30(4); 377-399.

Bouhlila D.S. and Sellaouti F. Multiple imputation using chained equations for missing data in timss: a case study. Large-scale Assessments in Education, 2013; 1(1); 1-33.

Liu X., Ma X., Meng X., Li X. and Xie G. IEEE ICHI data analytics challenge on missing data imputation by amelia II. In 2019 IEEE International Conference on Healthcare Informatics (ICHI), 2019, pp. 1-2.

Pampaka M., Hutcheson G. and Williams J. Handling missing data: analysis of a challenging data set using multiple imputation. International Journal of Research & Method in Education, 2016; 39(1); 19-37.

Buuren S.V., Brand J.P.L., Groothuis-Oudshoorn C.G.M. and Rubin D.B. Fully conditional specification in multivariate imputation. Journal of statistical computation and simulation, 2006; 76(12); 1049-1064.

Lee K.J. and Carlin J.B. Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. American journal of epidemiology, 2010; 171(5); 624-632.

Song J. and Belin T.R. Imputation for incomplete high-dimensional multivariate normal data using a common factor model. Statistics in medicine, 2004; 23(18); 2827-2843.

Rubin D.B. and Schafer J.L. Efficiently creating multiple imputations for incomplete multivariate normal data. In Proceedings of the Statistical Computing Section of the American Statistical Association, 1990, pp. 83-88.

Liu Y. and De A. Multiple imputation by fully conditional specification for dealing with missing data in a large epidemiologic study. International journal of statistics in medical research, 2015; 4(3); 287-295.

Schafer J.L. and Graham J.W. Missing data: our view of the state of the art. Psychological methods, 2002; 7(2); 147-177.

Horton N.J. and Kleinman K.P. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. The American Statistician, 2007; 61(1); 79-90.

Xu X., Xia L., Zhang Q., Wu S., Wu M. and Liu H. The ability of different imputation methods for missing values in mental measurement questionnaires. BMC Medical Research Methodology, 2020; 20(1); 1-9.

Lokupitiya R.S., Lokupitiya E. and Paustian K. Comparison of missing value imputation methods for crop yield data. Environmetrics: The official journal of the International Environmetrics Society, 2006; 17(4); 339-349.

Ford B.L. (1983) An overview of hot-deck procedures. Incomplete data in sample surveys, Academic Press, USA.

Andridge R.R. and Little R.J.A. A review of hot deck imputation for survey non-response. International statistical review, 2010; 78(1); 40-64.

Liao S.G., Lin Y., Kang D.D., Chandra D., Bon J., Kaminski N., Sciurba F.C. and Tseng G.C. Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC bioinformatics, 2014; 15(1); 1-12.

Ding Y. and Ross A. A comparison of imputation methods for handling missing scores in biometric fusion. Pattern Recognition, 2012; 45(3); 919-933.

Deza M.M. and Deza E. (2006) Dictionary of distances. Elsevier, Netherlands.

Kim K.Y., Kim B.J. and Yi G.S. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC bioinformatics, 2004; 5(1); 1-9.

Caruana R. A non-parametric em-style algorithm for imputing missing values. In International Workshop on Artificial Intelligence and Statistics, 2001, pp. 35-40.

Brás L.P. and Menezes J.C. Improving cluster-based missing value estimation of dna microarray data. Biomolecular engineering, 2007; 24(2); 273-282.

Kim H., Golub G.H and Park H. Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics, 2006; 22(11); 1410-1411.

Zhang X., Song X., Wang H. and Zhang H. Sequential local least squares imputation estimating missing value of microarray data. Computers in biology and medicine, 2008; 38(10); 1112-1120.

Cai Z., Heydari M. and Lin G. Iteratedc local least squares microarray missing value imputation. Journal of bioinformatics and computational biology, 2006; 4(5); 935-957.

Jadhav A., Pramod D. and Ramanathan K. Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 2019; 33(10); 913-933.

Zainuri N.A., Jemain A.A. and Muda N. A comparison of various imputation methods for missing values in air quality data. Sains Malaysiana, 2015; 44(3); 449-456.

Dongare A.D., Kharde R.R. and Kachare A.D. Introduction to artificial neural network. International Journal of Engineering and Innovative Technology (IJEIT), 2012; 2(1); 189-194.

Mishra M. and Srivastava M. A view of artificial neural network. In 2014 International Conference on Advances in Engineering & Technology Research (ICAETR-2014), 2014, pp. 1-3.

Abiodun O.I., Jantan A., Omolara A.E., Dada K.V., Mohamed N.A. and Arshad H. State-of-the-art in artificial neural network applications: A survey. Heliyon, 2018; 4(11); e00938.

Smieja M., Struski L., Tabor J., Zieli?ski B. and Spurek P. Processing of missing data by neural networks. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018, pp. 1-11.

Choudhury S.J. and Pal N.R. Imputation of missing data with neural networks for classification. Knowledge-Based Systems, 2019; 182; 104838.

Goodfellow I., Bengio Y. and Courville A. (2016) Deep Learning. MIT Press, USA.

Raaijmakers S. Deep Learning for Natural Language Processing. (2022) Leiden University, Netherlands.

Cheng C.Y., Tseng W.L., Chang C.F., Chang C.H. and Gau S.S.F. A deep learning approach for missing data imputation of rating scales assessing attention-deficit hyperactivity disorder. Frontiers in psychiatry, 2020; 11; 673.

Duan Y., Lv Y., Kang W. and Zhao Y. A deep learning based approach for traffic data imputation. In 17th International IEEE conference on intelligent transportation systems (ITSC), 2014, pp. 912-917.

Yoon S. and Sull S. Gamin: generative adversarial multiple imputation network for highly missing data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8456-8464.

Kim J., Tae D. and Seok J. A survey of missing data imputation using generative adversarial networks. In 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 2020, pp. 454-456.

Qiu Y.L., Zheng H. and Gevaert O. A deep learning framework for imputing missing values in genomic data. BioRxiv, 2018, p. 406066.

Beaulieu-Jones B.K. and Jason H Moore J.H. Missing data imputation in the electronic health record using deeply learned autoencoders. In Pacific symposium on biocomputing, 2017, pp. 207-218.

Yoon J., James J. and Schaar M.V.D. Missing data imputation using generative adversarial nets. Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 5689-5698.

Li S.C.X., Jiang B. and Marlin B. Misgan: Learning from incomplete data with generative adversarial networks. Seventh International Conference on Learning Representations (ICLR 2019), 2019, pp. 1-20.

Katoch S., Chauhan S.S. and Kumar V. A review on genetic algorithm: past, present, and future. Multimedia Tools and Applications, 2021; 80(5); 8091-8126.

Sulejmani A. and Koça O. Development of Optimal Transmission Rate of the Kinematic Chain by using Genetic Algorithms Coded in Mathcad. International Journal of Innovative Technology and Interdisciplinary Sciences, 2021; 4(4); 792-803.

Holland J.H. (1992) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press, USA.

Sampson J.R. (1976) Adaptation in natural and artificial systems University of Michigan Press, USA.

Haldurai L., Madhubala T. and Rajalakshmi R. A study on genetic algorithm and its applications. International Journal of computer sciences and Engineering, 2016; 4(10); 139-143.

Maulik U. and Bandyopadhyay S. Genetic algorithm-based clustering technique. Pattern recognition, 2000; 33(9); 1455-1465.

Shahzad W., Rehman Q. and Ahmed E. Missing data imputation using genetic algorithm for supervised learning. International Journal of Advanced Computer Science and Applications, 2017; 8(3); 438-445.

Noei M. and Abadeh M.S. A genetic asexual reproduction optimization algorithm for imputing missing values. In 2019 9th International Conference on Computer and Knowledge Engineering (ICCKE), 2019, pp. 214-218.

Priya R.D., Kuppuswami S. and Kumar S.M. A genetic algorithm approach for non-ignorable missing data. International Journal of Computer Applications, 2011; 20(4); 37-41.

Priya R.D. and Sivaraj R. Dynamic genetic algorithm-based feature selection and incomplete value imputation for microarray classification. Current Science, 2017; 112(1); 126-131.

Elzeki O.M., Alrahmawy M.F. and Elmougy S. A new hybrid genetic and information gain algorithm for imputing missing values in cancer genes datasets. International Journal of Intelligent Systems and Applications, 2019; 11(12); 20-33.

Lobato F., Sales C., Araujo I., Tadaiesky V., Dias L., Ramos L. and Santana A. Multi-objective genetic algorithm for missing data imputation. Pattern Recognition Letters, 2015; 68; 126-131.

Eberhart R. and Kennedy J. A new optimizer using particle swarm theory. In MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1995, pp. 39–43.

Wang D., Tan D. and Liu L. Particle swarm optimization algorithm: an overview. Soft Computing, 2018; 22(2); 387-408.

Krishna M. and Ravi V. Particle swarm optimization and covariance matrix based data imputation. In 2013 IEEE International Conference on Computational Intelligence and Computing Research, 2013, pp. 1-6.

Gautam C. and Ravi V. Data imputation via evolutionary computation, clustering, and a neural network. Neurocomputing, 2015; 156; 134-142.

Salleh M.N.M. and Samat N.A. Fcmpso: An imputation for missing data features in heart disease classification. In IOP Conference Series: Materials Science and Engineering, 2017; 226; 012102.

Candès E.J. and Recht B. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 2009; 9(6); 717-772.

Johnson C.R. Matrix completion problems: a survey. In Matrix theory and applications, 1990; 40; 171-198.

Candès E.J. and Tao T. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 2010; 56(5); 2053-2080.

Genes C. (2018) Novel Matrix Completion Methods for Missing Data Recovery in Urban Systems. PhD thesis, University of Sheffield.

Cai J.F., Candès E.J. and Shen Z. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 2010; 20(4); 1956-1982.

Chatterjee S. Matrix estimation by universal singular value thresholding. The Annals of Statistics, 2015; 43(1); 177-214.

Schnabel T., Swaminathan A., Singh A., Chandak N. and Joachims T. Recommendations as treatments: Debiasing learning and evaluation. In international conference on machine learning, 2016, pp. 1670-1679.

Zheng X., Wang M., Xu R., Li J. and Wang Y. Modeling dynamic missingness of implicit feedback for sequential recommendation. IEEE Transactions on Knowledge and Data Engineering, 2020; 34(1); 405-418.

Ma W. and Chen G.H. Missing not at random in matrix completion: The effectiveness of estimating missingness probabilities under a low nuclear norm assumption. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019, pp. 1-10.

Gavankar S. and Sawarkar S. Decision tree: Review of techniques for missing values at training, testing and compatibility. In 2015 3rd international conference on artificial intelligence, modelling and simulation (AIMS), 2015, pp. 122-126.

Twala B. and Cartwright M. Ensemble missing data techniques for software effort prediction. Intelligent Data Analysis, 2010; 14(3); 299-331.

Twala B., Jones M.C. and Hand D.J. Good methods for coping with missing data in decision trees. Pattern Recognition Letters, 2008; 29(7); 950-956.

Song Y.Y. and Lu Y. Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 2015; 27(2); 130-135.

Durmu? B. and Güneri Ö. ?. Investigation of Factors Affecting Immunotherapy Treatment Results by Binary Logistic Regression and Classification Analysis. International Journal of Innovative Technology and Interdisciplinary Sciences, 2020; 3(3), 467-473.

Tang F. and Ishwaran H. Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2017; 10(6); 363-377.

Davydenko A. and Fildes R. (2016) Forecast error measures: critical review and practical recommendations. Business Forecasting: Practical Problems and Solutions. Wiley, USA.

Chen C., Twycross J. and Garibaldi J.M. A new accuracy measure based on bounded relative error for time series forecasting. PloS one, 2017; 12(3); e0174202.

Botchkarev A. Evaluating performance of regression machine learning models using multiple error metrics in azure machine learning studio. SSRN, 2018;1-16.

Shcherbakov M.V., Brebels A., Shcherbakova N.L., Tyukov A.P., Janovsky T.A. and Kamaev V.A. A survey of forecast error measures. World Applied Sciences Journal, 2013; 24(24); 171-176.

Schmitt P., Mandel J. and Guedj M. A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics, 2015; 6(1); 1-6.

Li Y., Li Z., and Li L. Missing traffic data: comparison of imputation methods. IET Intelligent Transport Systems, 2014; 8(1); 51-57.

Nur-E-Arefin M. A Comparative Study of Machine Learning Classifiers for Credit Card Fraud Detection. International Journal of Innovative Technology and Interdisciplinary Sciences, 2020; 3(1), 395-406.

Hyndman R.J. and Koehler A.B. Another look at measures of forecast accuracy. International journal of forecasting, 2006; 22(4); 679-688.

Botchkarev A. Performance metrics (error measures) in machine learning regression, forecasting and prognostics: Properties and typology, Interdisciplinary Journal of Information, Knowledge, and Management, 2019; 14; 45-79.

Pascual C. Tutorial: Understanding regression error metrics in python. Available at: https://www.dataquest.io/blog/understanding-regression-error-metrics/, Accessed on 21 December 2021.

Junninen H., Niska H., Tuppurainen K., Ruuskanen J. and Kolehmainen M. Methods for imputation of missing values in air quality data sets. Atmospheric Environment, 2004; 38(18); 2895-2907.

Moepya S.O., Akhoury S.S., Nelwamondo F.V. and Twala B. The role of imputation in detecting fraudulent financial reporting. International Journal of Innovative Computing, Information and Control, 2016; 12(1); 333-356.

Beauxis-Aussalet E. and Hardman L. Visualization of confusion matrix for non-expert users. In IEEE Conference on Visual Analytics Science and Technology (VAST)-Poster Proceedings, 2014, pp. 1-2.

Le T.D., Beuran R. and Tan Y. Comparison of the most influential missing data imputation algorithms for healthcare. In 2018 10th International Conference on Knowledge and Systems Engineering (KSE), 2018, pp. 247-251.

Cihan P., Kal?ps?z O. and Gökçe E. Effect of imputation methods in the classifier performance. Sakarya Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 2019; 23(6); 1225-1236.

Xu X., Chong W., Li S., Arabo A. and Xiao J. Miaec: Missing data imputation based on the evidence chain. IEEE Access, 2018; 6:12983-12992.

Sefidian A.M. and Daneshpour N. Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model. Expert Systems with Applications, 2019; 115; 68-94.

Abdella M. and Marwala T. Treatment of missing data using neural networks and genetic algorithms. In Proceedings. IEEE International Joint Conference on Neural Networks, 2005; 1; 598-603.

Leke C. and Marwala T. Missing data estimation in high-dimensional datasets: A swarm intelligence-deep neural network approach. In International Conference on Swarm Intelligence, 2016, pp. 259-270.

Leke C., Twala B. and Marwala T. Modeling of missing data prediction: Computational intelligence and optimization algorithms. In 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2014, pp. 1400-1404.

Downloads

Published

2022-09-08

How to Cite

Luke Oluwaseye Joel, Wesley Doorsamy, & Babu Sena Paul. (2022). A Review of Missing Data Handling Techniques for Machine Learning. International Journal of Innovative Technology and Interdisciplinary Sciences, 5(3), 971–1005. https://doi.org/10.15157/IJITIS.2022.5.3.971-1005