Missing Values in Data and How to Handle Them?
Missing values are among the most common problems encountered in data analysis. They occur when a value for a particular variable is unavailable in one or more records. Examples include failing to record a customer’s age, a participant not answering a survey question, or a measuring device malfunctioning during an experiment.
An empty cell may appear to be a simple problem, but handling it incorrectly can lead to inaccurate results, the loss of a large amount of data, or the identification of false relationships between variables.
What Are Missing Values?
A missing value is a value that should have been recorded but is not available in the dataset. Missing values may appear in different forms in software, such as:
An empty cell.
The symbol NA.
The symbol NULL.
A question mark.
Special numbers such as 999 or -1, when these numbers are used by the person responsible for collecting the data to indicate that no response was provided.
It is important to confirm that numbers such as 999 are not genuine values before treating them as missing data.
Causes of Missing Values
Missing values may occur for several reasons, including:
A person refuses to answer a particular question.
Some information is accidentally not entered.
An error occurs during data transfer.
A measuring device malfunctions.
Part of a file is lost.
A question does not apply to certain participants.
Some participants withdraw before the study is completed.
Therefore, the first correct step is not to delete the missing values immediately, but to try to understand why they are missing.
Types of Missing Data
First: Missing Completely at Random — MCAR
This occurs when the probability that a value is missing is unrelated to any variable in the dataset, whether observed or unobserved.
Example: Some questionnaires are lost because of random damage to a storage device.
In this case, deleting a small number of records may be acceptable because the records with missing values do not systematically differ from the remaining records.
Second: Missing at Random — MAR
This occurs when the missingness can be explained using other information that is available in the dataset.
Example: Older people may be less likely to respond to an online survey. In this case, the missing responses are related to age, which is an observed variable.
Methods such as multiple imputation or statistical models that use the available variables can be applied in this situation.
Third: Missing Not at Random — MNAR
This occurs when the probability that a value is missing is related to the missing value itself or to a factor that has not been recorded in the dataset.
Example: People with very low or very high incomes may refuse to report their income. Therefore, the probability that income is missing depends on the income value itself.
This is the most difficult type of missing data because the available information alone may not be sufficient to estimate the missing values reliably. The researcher must investigate the reason for the missingness, make different assumptions, and conduct a sensitivity analysis.
How Is the Percentage of Missing Values Calculated?
The percentage of missing values for each variable can be calculated using the following formula:
Percentage of missing values = Number of missing values ÷ Total number of records × 100
For example, suppose a dataset contains 1,000 records, and the age value is missing in 80 records. The percentage of missing values in the age variable is therefore 8%.
It is preferable to calculate the percentage of missing values for every column and every row rather than calculating only one percentage for the entire dataset. One column may contain a very high percentage of missing values, while the other columns may be complete.
The Appropriate Decision Based on the Percentage of Missing Values
There is no single scientific percentage that is suitable for every project. However, the following guidelines may be used as a starting point.
Less Than 5%
This is generally considered a low percentage.
If the data are missing completely at random and the remaining number of records is sufficient, the records containing missing values may be deleted.
However, records should not be deleted automatically when the variable is highly important or when the missing records belong to a particular group.
From 5% to 20%
This is considered a moderate percentage.
Deleting all incomplete records is generally not recommended because it may result in the loss of a significant part of the sample. The following methods may be used:
The median for skewed numerical variables in simple analyses.
The mode for categorical variables.
Imputation using regression or the k-nearest neighbours method.
Multiple imputation in rigorous statistical studies.
Replacing missing values with the mean without careful consideration is not recommended because it reduces variability and may alter the relationships between variables.
From 20% to 40%
This is considered a high percentage.
Deleting records becomes more risky because it may result in a small or biased sample. Multiple imputation or maximum-likelihood methods are generally preferable. Auxiliary variables related to the missing variable or to the reason for its missingness should also be included.
The results should be compared using more than one method, and a sensitivity analysis should be conducted.
From 40% to 60%
This is considered a very high percentage.
The importance of the variable must be evaluated:
If the variable is not essential and suitable alternative variables are available, deleting it may be the most appropriate decision.
If the variable is necessary, the researcher should search for another source of data or attempt to collect the data again.
If this is not possible, advanced models may be used, but the results should clearly indicate that there is a high level of uncertainty.
More Than 60%
This situation is generally considered critical.
In many projects, deleting the variable may be safer when it is not essential because most of its information is unavailable. However, if the variable is the main subject or outcome of the study, deleting it is not an appropriate solution. It may be necessary to collect the data again or redesign the study.
This does not mean that a variable with more than 60% missing values can never be used. The remaining information may still be valuable, but the decision must be supported by a clear scientific justification.
Why Is the Percentage Alone Not Enough to Make a Decision?
A missing-data rate of 10% may be more dangerous than a rate of 40% in another situation.
For example, if 10% of income values are missing only among people with high incomes, deleting these records will produce an artificially low average income.
In contrast, losing 40% of the values of a secondary variable may be less serious if the missingness is random and other variables are available to help predict its values.
Therefore, the decision depends on five main factors:
The percentage of missing values.
The cause and type of missingness.
The importance of the variable.
The amount of data that remains available.
The purpose of the analysis, whether descriptive, predictive, or inferential.
Main Methods for Handling Missing Values
Deleting Rows
This method is suitable when the percentage of missing values is low, the data are missing completely at random, and the number of remaining records is sufficiently large.
Its disadvantage is that it reduces the sample size and may introduce bias when the missingness is not completely random.
Deleting the Variable
This method may be used when the percentage of missing values is extremely high, the variable is not essential, and suitable alternative variables are available.
Imputation Using the Mean, Median, or Mode
This is a simple and quick method, but it does not represent the uncertainty associated with the missing values.
The median is usually more appropriate than the mean when the data contain outliers or are highly skewed. The mode is suitable for categorical variables. However, these methods are generally not considered the best options for advanced statistical studies.
Model-Based Imputation
The missing value is predicted using other variables through methods such as regression, decision trees, or the k-nearest neighbours algorithm.
This approach may be useful in machine-learning projects, but the quality of the imputed values depends on the strength of the relationships between the variables.
Multiple Imputation
Multiple imputation creates several versions of the dataset. In each version, different plausible values are inserted in place of the missing values. All versions are then analysed, and the results are combined.
The main advantage of this method is that it takes uncertainty into account instead of treating a single predicted value as if it were certainly correct.
Sensitivity Analysis
Sensitivity analysis is particularly useful when the data are suspected to be missing not at random. It involves testing different assumptions about the missing values and examining whether the final results change significantly.
If the results change substantially when the assumptions are changed, the findings should be interpreted with caution.