Missing data is prevalent in real-world data and can be missing for various reasons. Gladly, both pandas and scikit-learn several imputation tools to deal with it. Pandas offers a basic yet powerful interface for univariate imputations using fillna
and more advanced functionality using interpolate
. scikit-learn offers both SimpleImputer
for univariate imputations and KNNImputer
and IterativeImputer
for multivariate imputations. In this post, we will focus on fillna
and SimpleImputer
functionality and compare them.
Basic Functionality
SimpleImputer
offers four strategies to fill in the nan values – mean, median, most_frequet, and constant.
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
df = pd.DataFrame(
[[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
imp_mean = SimpleImputer(strategy='mean')
pd.DataFrame(imp_mean.fit_transform(df))
output –
0 1 2
0 7.0 2.0 7.5
1 4.0 3.5 6.0
2 10.0 5.0 9.0
Can we achieve the same with pandas? Yes!
df.fillna(df.mean())
Want to impute with the most frequent value?
Asuume – df = pd.DataFrame(['a', 'a', 'b', np.nan])
With SimpleImputer
–
imp_mode = SimpleImputer(
strategy='most_frequent')
pd.DataFrame(
imp_mode.fit_transform(df))
With fillna
–
df.fillna(df.mode()[0])
And the output of both –
0
0 a
1 a
2 b
3 a
Different Strategies
Want to apply different strategies for different columns? using scikit-learn you will need several imputers, one per each strategy. Using fillna
you can pass a dictionary, for example –
df = pd.DataFrame(
[[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
df.fillna({1: 10000, 2: df[2].mean()})
0 1 2
0 7 2.0 7.5
1 4 10000.0 6.0
2 10 5.0 9.0
Advanced Usage
Want to impute values drawn from a normal distribution, no brainer –
mean = 5
scale = 2
df = pd.DataFrame(
[[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
df.fillna(
pd.DataFrame(
(np.random.normal(mean, scale, df.shape))
0 1 2
0 7 2.000000 3.857513
1 4 5.407452 6.000000
2 10 5.000000 9.000000
Missing indicator
Using SimpleImputer, one can add indicator columns that obtain 1
if the original column was missing, and 0
otherwise. This can also be done using MissingIndicator
df = pd.DataFrame(
[[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
mean_imp = SimpleImputer(strategy='mean', add_indicator=True)
mean_imp.fit_transform(df)
pd.DataFrame(mean_imp.fit_transform(df))
0 1 2 3 4
0 7.0 2.0 7.5 0.0 1.0
1 4.0 3.5 6.0 1.0 0.0
2 10.0 5.0 9.0 0.0 0.0
Note that a missing column (i.e., columns 3 and 4 in the example above) corresponds only to columns with missing values. Therefore there is no missing indicator column corresponding to the column 0
. If you are converting back and forth to pandas dataframes you should note this nuance.
Another nuance to note when working with SimpleImputer is that columns that contain only missing values are dropped by default –
df = pd.DataFrame(
[[7, 2, np.nan, np.nan], [4, np.nan, 6, np.nan],
[10, 5, 9, np.nan]])
mean_imp = SimpleImputer(strategy='mean')
pd.DataFrame(mean_imp.fit_transform(df))
0 1 2
0 7.0 2.0 7.5
1 4.0 3.5 6.0
2 10.0 5.0 9.0
This behavior is controllable using setting keep_empty_features=True
. While it is manageable, tracing columns might be challenging –
mean_imp = SimpleImputer(
strategy='mean',
keep_empty_features=True,
add_indicator=True)
pd.DataFrame(mean_imp.fit_transform(df))
0 1 2 3 4 5 6
0 7.0 2.0 7.5 0.0 0.0 1.0 1.0
1 4.0 3.5 6.0 0.0 1.0 0.0 1.0
2 10.0 5.0 9.0 0.0 0.0 0.0 1.0
There is an elegant way to achieve similar behavior in pandas –
df = pd.DataFrame(
[[7, 2, np.nan, np.nan], [4, np.nan, 6, np.nan],
[10, 5, 9, np.nan]])
pd.concat(
[df.fillna(df.mean()),
df.isnull().astype(int).add_suffix("_ind")], axis=1)
0 1 2 3 0_ind 1_ind 2_ind 3_ind
0 7 2.0 7.5 NaN 0 0 1 1
1 4 3.5 6.0 NaN 0 1 0 1
2 10 5.0 9.0 NaN 0 0 0 1
Working with dates
Want to work with dates and fill several columns with different types? No problem with pandas –
df = pd.DataFrame(
{"date": [
datetime(2023, 6, 20), np.nan,
datetime(2023, 6, 18), datetime(2023, 6, 16)],
"values": [np.nan, 1, 3, np.nan]})
df.fillna(df.mean())
Before –
date values
0 2023-06-20 NaN
1 NaT 1.0
2 2023-06-18 3.0
3 2023-06-16 NaN
After –
date values
0 2023-06-20 2.0
1 2023-06-18 1.0
2 2023-06-18 3.0
3 2023-06-16 2.0
Working with dates is an advantage that fillna has over SimpleImputer.
Backward and forward filling
So far, we treated the records and their order as independent. That is, we could have shuffled the records and that would not affect the expected imputed value. However, there are cases, for example, when representing time series when the order matters and we would like to impute based on later values (backfill) or earlier values (forward fill). This is done by setting the method
property.
df = pd.DataFrame(
[[7, 2, np.nan], [4, np.nan, 6],
[10, np.nan, 9], [np.nan, 5, 10]])
df.fillna(method='bfill')
0 1 2
0 7.0 2.0 6.0
1 4.0 5.0 6.0
2 10.0 5.0 9.0
3 NaN 5.0 10.0
One can also limit the number of consecutive values which are imputed –
df.fillna(method='bfill', limit=1)
0 1 2
0 7.0 2.0 6.0
1 4.0 NaN 6.0
2 10.0 5.0 9.0
3 NaN 5.0 10.0
Note that when using bfill
or ffill
and moreover, when specifying limit
to value other than None
it is possible that not all the values would be imputed.
For me, that’s a killer feature of fillna
comparing to SimpleImputer
Treat Infinite values as na
Setting pd.options.mode.use_inf_as_na = True
will treat infinite values (i.e. np.inf, np.INF, np.NINF
) values as missing values, for example –
df = pd.DataFrame([1, 2, np.inf, np.nan])
df.fillna(1000)
pd.options.mode.use_inf_as_na = False
0
0 1.0
1 2.0
2 inf
3 1000.0
pd.options.mode.use_inf_as_na = True
0
0 1.0
1 2.0
2 1000.0
3 1000.0
Note that inf
and na
are not treated the same for other use cases, e.g. – df[0].value_counts(dropna=False)
–
0
1.0 1
2.0 1
NaN 1
NaN 1
Summary
Both pandas and scikit-learn offer a basic functionality to deal with missing values. Assuming you are working with pandas Dataframe, pandas fillna
functionality can achieve everything SimpleImputer
can do and more – working with dates, back and forward fill, etc. Additionally, there are some edge cases and specific behaviors to pay attention to when choosing what to use. For example when using bfill
or ffill
method some values may not be imputed if there are the last ones or first ones respectively.
One thought on “Pandas fillna vs scikit-learn SimpleImputer”