5 tips for using Pandas

Recently, I worked closely with Pandas and found out a few things that are might common knowledge but were new to me and helped me write more efficient code in less time.


1. Don’t drop the na

Count the number of unique values including Na values.

Consider the following pandas DataFrame –

df = pd.DataFrame({"userId": list(range(5))*2 +[1, 2, 3],
                   "purchaseId": range(13),
                   "discountCode": [1, None]*5 + [2, 2, 2]})

Result

If I want to count the discount codes by type I might use –  df['discountCode'].value_counts() which yields – 

1.0    5
2.0    3

This will miss the purchases without discount codes. If I also care about those, I should do –

df['discountCode'].value_counts(dropna=False)

which yields –

NaN    5
1.0    5
2.0    3

This is also relevant for nuniqiue. For example, if I want to count the number of unique discount codes a user used – df.groupby("userId").agg(count=("discountCode", lambda x: x.nunique(dropna=False)))

See more here – https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

2. Margin on Row \ columns  only

 Following the above example, assume you want to know for each discount code which users used it and for each user which discount code she used. Additionally you want to know has many unique discount codes each user used and how many unique users used each code, you can use pivot table with margins argument –

df.pivot_table(index="userId", columns="discountCode",
               aggfunc="nunique", fill_value=0,
               margins=True)

Result –

It would be nice to have the option to get margins only for rows or only for columns. The dropna option does not act as expected – the na values are taken into account in the aggregation function but not added as a column or an index in the resulted Dataframe.

3. plotly backend


Pandas plotting capabilities is nice but you can go one step further and use plotly very easy by setting plotly as pandas plotting backend.  Just add the following line after importing pandas (no need to import plotly, you do need to install it) –

pd.options.plotting.backend = "plotly"

Note that plotly still don’t support all pandas plotting options (e.g subplots, hexbins) but I believe it will improve in the future. 


See more here – https://plotly.com/python/pandas-backend/


4. Categorical dtype and qcut

Categorical variables are common – e.g., gender, race, part of day, etc. They can be ordered (e.g part of day) or unordered (e.g gender). Using categorical data type one can validate data values better and compare them in case they are ordered (see user guide here). qcut allows us to customize binning for discrete and categorical data.

See documentation here and the post the caught my attention about it here – https://medium.com/datadriveninvestor/5-cool-advanced-pandas-techniques-for-data-scientists-c5a59ae0625d

5. tqdm integration


tqdm is a progress bar that wraps any Python iterable, you can also use to follow the progress of pandas apply functionality using progress_apply instead of apply (you need to initialize tqdm before by doing tqdm.pandas()).

See more here – https://github.com/tqdm/tqdm#pandas-integration

Advertisement

5 interesting things (04/12/2020)

How to set compensation using commonsense principles – yet another artwork by Erik Bernhardsson. I like his analytics approach and the way he models his ideas. His manifest regarding compensation systems (Good/bad compensation systems) is brilliant. I believe most of us agree with him while he put it into words. His modeling has some drawbacks that he is aware of. For example, assuming certainty in employee productivity, almost perfect knowledge of the market. Yet, it is totally worth your time.

https://erikbern.com/2020/06/08/how-to-set-compensation-using-commonsense-principles.html

7 Over Sampling techniques to handle Imbalanced Data – imbalanced data is a common real world scenario, specifically in healthcare where most of the patients don’t have a certain condition one is looking for. Over-sampling is a method to handle imbalanced data, this post describes several techniques to handle it. Interestingly, at least in this specific example, most of the techniques do not bring significant improvement. I would therefore compare several techniques and won’t just try one of them. 
https://towardsdatascience.com/7-over-sampling-techniques-to-handle-imbalanced-data-ec51c8db349f

This post uses a the following package which I didn’t know before (it would be great if it could become part of scikit-learn) – https://imbalanced-learn.readthedocs.io/en/stable/index.html

It would be nice to see a similar post fo downsampling techniques.


Python’s do’s and don’t do – very nicely and written with good examples – 
https://towardsdatascience.com/10-quick-and-clean-coding-hacks-in-python-1ccb16aa571b

Every Complex DataFrame Manipulation, Explained & Visualized Intuitively – can’t remember how pandas function work? great, you are not alone. You can use this guide to quickly remind you how melt, explode, pivot and others work.
https://medium.com/analytics-vidhya/every-dataframe-manipulation-explained-visualized-intuitively-dbeea7a5529e

Causal Inference that’s not A/B Testing: Theory & Practical Guide – Causality is often overlooked in the industry. Many times you developed a model that is “good enough” and move on. However, this might increase bias and lead to unfavourable results. This post suggests a hands-on approach to causality accompanied by code samples.

https://towardsdatascience.com/causal-inference-thats-not-a-b-testing-theory-practical-guide-f3c824ac9ed2