Did you Miss me? PyCon IL 2023

Today I talked about working with missing data at PyCon IL. We started with a bit of theory about mechanisms of missing data –

MCAR – The fact that the data are missing is independent of the observed and unobserved data.
MAR – The fact that the data are missing is systematically related to the observed but not the unobserved data.
MNAR – The fact that the data are missing is systematically related to the unobserved data.

And deep-dived into an almost real-world example that utilizes the Python ecosystem – pandas, scikit-learn, and missingno.

My slides are available here and my code is here.

3 related posts I wrote about working with missing data in Python –

Pandas fillna vs scikit-learn SimpleImputer

Missing data is prevalent in real-world data and can be missing for various reasons. Gladly, both pandas and scikit-learn several imputation tools to deal with it. Pandas offers a basic yet powerful interface for univariate imputations using fillna and more advanced functionality using interpolate. scikit-learn offers both SimpleImputer for univariate imputations and KNNImputer and IterativeImputer for multivariate imputations. In this post, we will focus on fillna and SimpleImputer functionality and compare them.

Basic Functionality

SimpleImputer offers four strategies to fill in the nan values – mean, median, most_frequet, and constant.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
imp_mean = SimpleImputer(strategy='mean')
pd.DataFrame(imp_mean.fit_transform(df))

output –

      0    1    2
0   7.0  2.0  7.5
1   4.0  3.5  6.0
2  10.0  5.0  9.0

Can we achieve the same with pandas? Yes!

df.fillna(df.mean())

Want to impute with the most frequent value?

Asuume – df = pd.DataFrame(['a', 'a', 'b', np.nan])

With SimpleImputer –

imp_mode = SimpleImputer(
    strategy='most_frequent')
pd.DataFrame(
  
  imp_mode.fit_transform(df))

With fillna –

df.fillna(df.mode()[0])

And the output of both –

Different Strategies

Want to apply different strategies for different columns? using scikit-learn you will need several imputers, one per each strategy. Using fillna you can pass a dictionary, for example –

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
df.fillna({1: 10000, 2: df[2].mean()})

    0        1    2
0   7      2.0  7.5
1   4  10000.0  6.0
2  10      5.0  9.0

Advanced Usage

Want to impute values drawn from a normal distribution, no brainer –

mean = 5
scale = 2
df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
df.fillna(
    pd.DataFrame(
        (np.random.normal(mean, scale, df.shape))

    0         1         2
0   7  2.000000  3.857513
1   4  5.407452  6.000000
2  10  5.000000  9.000000

Missing indicator

Using SimpleImputer, one can add indicator columns that obtain 1 if the original column was missing, and 0 otherwise. This can also be done using MissingIndicator

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
mean_imp = SimpleImputer(strategy='mean', add_indicator=True)
mean_imp.fit_transform(df)
pd.DataFrame(mean_imp.fit_transform(df))

      0    1    2    3    4
0   7.0  2.0  7.5  0.0  1.0
1   4.0  3.5  6.0  1.0  0.0
2  10.0  5.0  9.0  0.0  0.0

Note that a missing column (i.e., columns 3 and 4 in the example above) corresponds only to columns with missing values. Therefore there is no missing indicator column corresponding to the column 0. If you are converting back and forth to pandas dataframes you should note this nuance.

Another nuance to note when working with SimpleImputer is that columns that contain only missing values are dropped by default –

df =  pd.DataFrame(
    [[7, 2, np.nan, np.nan], [4, np.nan, 6, np.nan],
    [10, 5, 9, np.nan]])
mean_imp = SimpleImputer(strategy='mean')
pd.DataFrame(mean_imp.fit_transform(df))

      0    1    2
0   7.0  2.0  7.5
1   4.0  3.5  6.0
2  10.0  5.0  9.0

This behavior is controllable using setting keep_empty_features=True. While it is manageable, tracing columns might be challenging –

mean_imp = SimpleImputer(
    strategy='mean',
    keep_empty_features=True,
    add_indicator=True)
pd.DataFrame(mean_imp.fit_transform(df))

      0    1    2    3    4    5    6
0   7.0  2.0  7.5  0.0  0.0  1.0  1.0
1   4.0  3.5  6.0  0.0  1.0  0.0  1.0
2  10.0  5.0  9.0  0.0  0.0  0.0  1.0

There is an elegant way to achieve similar behavior in pandas –

df = pd.DataFrame(
    [[7, 2, np.nan, np.nan], [4, np.nan, 6, np.nan],
     [10, 5, 9, np.nan]])
pd.concat(
    [df.fillna(df.mean()), 
     df.isnull().astype(int).add_suffix("_ind")], axis=1)

    0    1    2   3  0_ind  1_ind  2_ind  3_ind
0   7  2.0  7.5 NaN      0      0      1      1
1   4  3.5  6.0 NaN      0      1      0      1
2  10  5.0  9.0 NaN      0      0      0      1

Working with dates

Want to work with dates and fill several columns with different types? No problem with pandas –

df = pd.DataFrame(
    {"date": [
        datetime(2023, 6, 20), np.nan,
        datetime(2023, 6, 18), datetime(2023, 6, 16)],
     "values": [np.nan, 1, 3, np.nan]})
df.fillna(df.mean())

Before –

        date  values
0 2023-06-20     NaN
1        NaT     1.0
2 2023-06-18     3.0
3 2023-06-16     NaN

After –

        date  values
0 2023-06-20     2.0
1 2023-06-18     1.0
2 2023-06-18     3.0
3 2023-06-16     2.0

Working with dates is an advantage that fillna has over SimpleImputer.

Backward and forward filling

So far, we treated the records and their order as independent. That is, we could have shuffled the records and that would not affect the expected imputed value. However, there are cases, for example, when representing time series when the order matters and we would like to impute based on later values (backfill) or earlier values (forward fill). This is done by setting the method property.

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6],
     [10, np.nan, 9], [np.nan, 5, 10]])
df.fillna(method='bfill')

      0    1     2
0   7.0  2.0   6.0
1   4.0  5.0   6.0
2  10.0  5.0   9.0
3   NaN  5.0  10.0

One can also limit the number of consecutive values which are imputed –

df.fillna(method='bfill', limit=1)

      0    1     2
0   7.0  2.0   6.0
1   4.0  NaN   6.0
2  10.0  5.0   9.0
3   NaN  5.0  10.0

Note that when using bfill or ffill and moreover, when specifying limit to value other than None it is possible that not all the values would be imputed.

For me, that’s a killer feature of fillna comparing to SimpleImputer

Treat Infinite values as na

Setting pd.options.mode.use_inf_as_na = True will treat infinite values (i.e. np.inf, np.INF, np.NINF) values as missing values, for example –

df = pd.DataFrame([1, 2, np.inf, np.nan])
df.fillna(1000)

pd.options.mode.use_inf_as_na = False

pd.options.mode.use_inf_as_na = True

Note that inf and na are not treated the same for other use cases, e.g. – df[0].value_counts(dropna=False)–

0
1.0    1
2.0    1
NaN    1
NaN    1

Summary

Both pandas and scikit-learn offer a basic functionality to deal with missing values. Assuming you are working with pandas Dataframe, pandas fillna functionality can achieve everything SimpleImputer can do and more – working with dates, back and forward fill, etc. Additionally, there are some edge cases and specific behaviors to pay attention to when choosing what to use. For example when using bfill or ffill method some values may not be imputed if there are the last ones or first ones respectively.

pandas read_csv and missing values

I read Domino Lab post about “Data Exploration with Pandas Profiler and D-Tale” where they load diagnostic mammograms used in the diagnostic of breast cancer from UCI website. Instead of missiing values the data contains ?. When reading the data using pandas read_csv function naively interpret the value as string value and change the column type to be object instead of float in this case.

In the post mentioned above the authors dealt with the issue in the following way –

masses = masses.replace('?', np.NAN)
masses.loc[:,names[:-1]] = masses.loc[:,names[:-1]].apply(pd.to_numeric)

That is, they first replaced the ? values in all the columns with np.NAN and then convert all the columns to numeric. Let’s call this method the manual method.

If we know the know the non default missing values in advance, can we do something better? The answer is yes!

See code here

Use na_values parameter

df = pd.read_csv(url, names=names, na_values=["?"])

na_values parameter can get scalar, string, list-like or dict parameters. If you pass a scalar, string or list-like parameter all columns are treated the same way. If you pass dict you can specify different set of NaN values per column.

The advantage of this method over the manual method is that you don’t need to convert the columns after replacing the nan values. In the manual method the column types are specified (in the given case they are all numeric), if there are multiple columns types you need to know it and specify it in advance.

Side note – likewise, for non trivial boolean values you can use true_values and false_values parameters.

Use converters parameter

df = pd.read_csv(url, names=names, converters={"BI-RADS": lambda x: x if x!="?" else np.NAN})

This is usually used to convert values of specific columns. If you would like to convert values in all the columns in the same way this is not the preferred method since you will have to add an entry for each column and if new column is added you won’t take care of it by default (this can be both advantage and disadvantage). However, for other use-cases, converters can help with more complex conversions.

Note that the result here is different then the result in the other methods since we only converted the values in one column.

Conclusion

Pandas provides several ways to deal with non-trivial missing values. If you know the non-trivial value in advance you are good to go and na_values is most likely the best way to go.

Performance wise (time) all methods perform roughly the same for the given dataset but that can change as a function on the dataset size (columns and rows), row types, number of non-trivial missing values.

On top of it, make reading documentation your superpower. It can use your tools smarter and more efficient and it can save you a lot of time.

See pandas read_csv documentation here

	Nicole S on 5 Python NLP pacakges
	blissful4bdd2399fa on CSV to radar plot
	tom on CSV to radar plot
	Matt on CSV to radar plot
	“ – Tom… on 📚 Book club Q1 2024 – 3…