Pandas fillna vs scikit-learn SimpleImputer

Missing data is prevalent in real-world data and can be missing for various reasons. Gladly, both pandas and scikit-learn several imputation tools to deal with it. Pandas offers a basic yet powerful interface for univariate imputations using fillna and more advanced functionality using interpolate. scikit-learn offers both SimpleImputer for univariate imputations and KNNImputer and IterativeImputer for multivariate imputations. In this post, we will focus on fillna and SimpleImputer functionality and compare them.

Basic Functionality

SimpleImputer offers four strategies to fill in the nan values – mean, median, most_frequet, and constant.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
imp_mean = SimpleImputer(strategy='mean')
pd.DataFrame(imp_mean.fit_transform(df))

output –

      0    1    2
0   7.0  2.0  7.5
1   4.0  3.5  6.0
2  10.0  5.0  9.0

Can we achieve the same with pandas? Yes!

df.fillna(df.mean())

Want to impute with the most frequent value?

Asuume – df = pd.DataFrame(['a', 'a', 'b', np.nan])

With SimpleImputer

imp_mode = SimpleImputer(
    strategy='most_frequent')
pd.DataFrame(
  
  imp_mode.fit_transform(df))

With fillna

df.fillna(df.mode()[0])

And the output of both –

   0
0  a
1  a
2  b
3  a

Different Strategies

Want to apply different strategies for different columns? using scikit-learn you will need several imputers, one per each strategy. Using fillna you can pass a dictionary, for example –

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
df.fillna({1: 10000, 2: df[2].mean()})
    0        1    2
0   7      2.0  7.5
1   4  10000.0  6.0
2  10      5.0  9.0

Advanced Usage

Want to impute values drawn from a normal distribution, no brainer –

mean = 5
scale = 2
df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
df.fillna(
    pd.DataFrame(
        (np.random.normal(mean, scale, df.shape))
    0         1         2
0   7  2.000000  3.857513
1   4  5.407452  6.000000
2  10  5.000000  9.000000

Missing indicator

Using SimpleImputer, one can add indicator columns that obtain 1 if the original column was missing, and 0 otherwise. This can also be done using MissingIndicator

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
mean_imp = SimpleImputer(strategy='mean', add_indicator=True)
mean_imp.fit_transform(df)
pd.DataFrame(mean_imp.fit_transform(df))
      0    1    2    3    4
0   7.0  2.0  7.5  0.0  1.0
1   4.0  3.5  6.0  1.0  0.0
2  10.0  5.0  9.0  0.0  0.0

Note that a missing column (i.e., columns 3 and 4 in the example above) corresponds only to columns with missing values. Therefore there is no missing indicator column corresponding to the column 0. If you are converting back and forth to pandas dataframes you should note this nuance.

Another nuance to note when working with SimpleImputer is that columns that contain only missing values are dropped by default –

df =  pd.DataFrame(
    [[7, 2, np.nan, np.nan], [4, np.nan, 6, np.nan],
    [10, 5, 9, np.nan]])
mean_imp = SimpleImputer(strategy='mean')
pd.DataFrame(mean_imp.fit_transform(df))
      0    1    2
0   7.0  2.0  7.5
1   4.0  3.5  6.0
2  10.0  5.0  9.0

This behavior is controllable using setting keep_empty_features=True. While it is manageable, tracing columns might be challenging –

mean_imp = SimpleImputer(
    strategy='mean',
    keep_empty_features=True,
    add_indicator=True)
pd.DataFrame(mean_imp.fit_transform(df))
      0    1    2    3    4    5    6
0   7.0  2.0  7.5  0.0  0.0  1.0  1.0
1   4.0  3.5  6.0  0.0  1.0  0.0  1.0
2  10.0  5.0  9.0  0.0  0.0  0.0  1.0

There is an elegant way to achieve similar behavior in pandas –

df = pd.DataFrame(
    [[7, 2, np.nan, np.nan], [4, np.nan, 6, np.nan],
     [10, 5, 9, np.nan]])
pd.concat(
    [df.fillna(df.mean()), 
     df.isnull().astype(int).add_suffix("_ind")], axis=1)
    0    1    2   3  0_ind  1_ind  2_ind  3_ind
0   7  2.0  7.5 NaN      0      0      1      1
1   4  3.5  6.0 NaN      0      1      0      1
2  10  5.0  9.0 NaN      0      0      0      1

Working with dates

Want to work with dates and fill several columns with different types? No problem with pandas –

df = pd.DataFrame(
    {"date": [
        datetime(2023, 6, 20), np.nan,
        datetime(2023, 6, 18), datetime(2023, 6, 16)],
     "values": [np.nan, 1, 3, np.nan]})
df.fillna(df.mean())

Before –

        date  values
0 2023-06-20     NaN
1        NaT     1.0
2 2023-06-18     3.0
3 2023-06-16     NaN

After –

        date  values
0 2023-06-20     2.0
1 2023-06-18     1.0
2 2023-06-18     3.0
3 2023-06-16     2.0

Working with dates is an advantage that fillna has over SimpleImputer.

Backward and forward filling

So far, we treated the records and their order as independent. That is, we could have shuffled the records and that would not affect the expected imputed value. However, there are cases, for example, when representing time series when the order matters and we would like to impute based on later values (backfill) or earlier values (forward fill). This is done by setting the method property.

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6],
     [10, np.nan, 9], [np.nan, 5, 10]])
df.fillna(method='bfill')
      0    1     2
0   7.0  2.0   6.0
1   4.0  5.0   6.0
2  10.0  5.0   9.0
3   NaN  5.0  10.0

One can also limit the number of consecutive values which are imputed –

df.fillna(method='bfill', limit=1)
      0    1     2
0   7.0  2.0   6.0
1   4.0  NaN   6.0
2  10.0  5.0   9.0
3   NaN  5.0  10.0

Note that when using bfill or ffill and moreover, when specifying limit to value other than None it is possible that not all the values would be imputed.

For me, that’s a killer feature of fillna comparing to SimpleImputer

Treat Infinite values as na

Setting pd.options.mode.use_inf_as_na = True will treat infinite values (i.e. np.inf, np.INF, np.NINF) values as missing values, for example –

df = pd.DataFrame([1, 2, np.inf, np.nan])
df.fillna(1000)

pd.options.mode.use_inf_as_na = False

     0
0  1.0
1  2.0
2  inf
3  1000.0

pd.options.mode.use_inf_as_na = True

     0
0  1.0
1  2.0
2  1000.0
3  1000.0

Note that inf and na are not treated the same for other use cases, e.g. – df[0].value_counts(dropna=False)

0
1.0    1
2.0    1
NaN    1
NaN    1

Summary

Both pandas and scikit-learn offer a basic functionality to deal with missing values. Assuming you are working with pandas Dataframe, pandas fillna functionality can achieve everything SimpleImputer can do and more – working with dates, back and forward fill, etc. Additionally, there are some edge cases and specific behaviors to pay attention to when choosing what to use. For example when using bfill or ffill method some values may not be imputed if there are the last ones or first ones respectively.

Few thoughts on Cloud FinOps Book

I just completed “Cloud FinOps” book by J.R. Storment and Mike Fuller, and here are a few thoughts –

  1. At first, I wondered whether I should read the 1st edition, which I had easy access to, or the 2nd, which I had to buy. After reading a sample, I decided to buy the 2nd edition and am glad. This domain and community move quickly; a 2019 version would have been outdated and misleading.
  2. FinOps involves a paradigm shift – developers should consider not only the performance of their architecture (i.e., memory and CPU consumption, speed, etc.) but the cost associated with the resources they will use. Procurement is not done and approved by the finance team anymore. Developers’ decisions can have a significant influence on the cloud bill. FinOps teams bridge the engineering and finance teams (and more) and speak the language of all parties, along with additional skill sets and an overview of the entire organization. 
  3. A general rule of thumb regarding commitments –
    1. Longer commitment period (3 years → 1 year) = lower price (higher discount)
    2. More upfront (full upfront → partial upfront → no upfront )= lower price (higher discount)
    3. More specific (RI → Convertible RI → SP, region, etc.) = lower price (higher discount)
  4. The FinOps team should be up-to-date about the new cloud technologies updates and cost reduction options. I have been familiar with reserve and spot instances for a long time, but there are many other cost reduction options bits and bytes to pay attention to. For example, the following 2 points –
    1. When purchasing saving plans (SP), which are monetary as appose to resource units commitments, the spend amount you commit to is post discount. Moreover, AWS will apply the SP to the resources that yield the highest discount. This implies that the discount rate diminishes when committing to more money.
    2. CloudFront security savings bundle (here) is a saving plan that ties together the usage of CloudFront and WAF. The book predicts that such plans, e.g., combining multiple product usage, will become common soon.
  5. Commitments (e.g., SP, RI) are one of many ways to reduce costs. Removing idle resources (e.g., unattached drives), using correct storage classes (e.g., infrequent access, glacier), or making architecture changes (e.g., rightsizing, moving from servers to serverless, going via VPC endpoints, etc.) can help avoid and reduce cost. Those activities can happen in parallel – centralized FinOps team to manage commitments (aka cost reduction) and decentralized engineering teams optimize the resources they use (aka cost avoidance). Ideally, it is a tango. Each team moves a little step at a time to optimize their part.
  6. The FinOps domain-specific knowledge goes even further. For example, costs that engineers tend to miss or wrongly estimate e.g. network traffic cost, number of events, data storage events.
  7. The inform phase is part of the FinOps lifecycle – making the data available to the relevant participants. The Prius effect, i.e., real-time feedback, instantly influences behavior even without explicit recommendations or guidance. Visualizations (done right) can help understand and react to the data better. A point emphasized multiple times in the book – put the data in the path of the engineers or any other stakeholder. Don’t ask them to log in to a different system to review the data; integrate with existing systems they use regularly.

Few resources I find helpful – 

  1. FinOps foundation website – includes many resources and community knowledge – https://www.finops.org/introduction/what-is-finops/
  1. FinOps podcast – https://www.finops.org/community/finops-podcast/
  2. Infracost lets engineers see a cost breakdown and understand costs before making changes in the terminal, VS Code, or pull requests. https://www.infracost.io/
  3. Cloud Custodian – “Cloud Custodian is a tool that unifies the dozens of tools and scripts most organizations use for managing their public cloud accounts into one open source tool” – https://cloudcustodian.io/
  4. FinOut – A Holistic Cost Management Solution For Your Cloud. I recently participated in a demo and that looks super interesting. https://www.finout.io/
  5. Startup guide to data cost optimization – my post summarizing AWS’s ebook about data cost optimization for startups – https://tomron.net/2023/06/01/startup-guide-to-data-cost-optimization-summary/
  6. Twitter thread I wrote in Hebrew about the book – https://twitter.com/tomron696/status/1657686198327062529

Startup guide to data cost optimization – summary

I read a lot about FinOps and cloud cost optimization those days and I came across AWS short ebook about data cost optimization

Cost optimization is part of AWS’s well-architected framework. When we think about cost optimization, we usually only consider computing resources, while there are significant optimizations that can go beyond that – storage optimization, network, etc.

Below is a combination of the six sections that appear in the e-books with some comments –

Optimize the cost of information infrastructure – the main point in this section is to use Graviton instances where applicable.

Decouple storage data from compute data – 5 suggestions here which are pretty standard –

  1. Compress data when applicable, and use optimal data structures for your task.
  2. Consider data temperature when choosing data store and storage class – use the suitable s3 storage class and manage it using a life-cycle policy.
  3. Use low-cost compute resources, such as Spot Instances, when applicable – I have some dissonance here since I’m not sure that spot instances are attractive those days (see here), specifically with the overhead of taking care of preempted instances. 
  4. Deploy compute close to data to reduce data transfer costs – trivial.
  5. Use Amazon S3 Select and Amazon S3 Glacier Select to reduce data retrieval – Amazon S3 Select has several limitations (see here), so I’m not sure it is worth the effort and better query via Athena.

Plan and provision capacity for predictable workload usage

  1. Choosing the right instance type based on workload pattern and growth  – is common sense. You’ll save a little less if you purchase convertible reserve instances. However, in a fast-changing startup environment, there is a higher chance the commitment won’t be underutilized.
  2. Deploying rightsizing based on average or medium workload usage – this contradicts best practices described in Cloud FinOps book, so I’m a bit hesitant here.
  3. Using automatic scaling capabilities to meet peak demand – is the most relevant advice in this section. Use auto-scaling groups or similar to accommodate for both performance and cost.

Access capacity on demand for unpredictable workloads

  1. Use Amazon Athena for ad hoc SQL workloads – as mentioned above, I prefer Athena over AWS S3 Select.
  2. Use AWS Glue instead of Amazon EMR for infrequent ETL jobs – I don’t have a strong opinion here, but if you have a data strategy in mind, I will try to adjust to it. Additionally, I feel that other AWS can be even easier and cost-effective to work with—for example,  Apache Spark in Amazon Athena, step functions, etc.
  3. Use on-demand resources for transient workloads or short-term development and testing needs – having said that, you should still keep an eye on your production services, ensure they are utilized correctly and rightsize them if needed.

Avoid data duplication with a centralized storage layer

Implement a central storage layer to share data among tenants – I would shorten it to saying, “have a data strategy” – where you are, where you want to go, etc., which is not trivial in early startup days.

Leverage up to $100,000 in AWS Activate credits

This might be a bit contracting to the rest of the document since it feels like free money and delays your concern about cloud costs.