5 tips for using Pandas

Recently, I worked closely with Pandas and found out a few things that are might common knowledge but were new to me and helped me write more efficient code in less time.


1. Don’t drop the na

Count the number of unique values including Na values.

Consider the following pandas DataFrame –

df = pd.DataFrame({"userId": list(range(5))*2 +[1, 2, 3],
                   "purchaseId": range(13),
                   "discountCode": [1, None]*5 + [2, 2, 2]})

Result

If I want to count the discount codes by type I might use –  df['discountCode'].value_counts() which yields – 

1.0    5
2.0    3

This will miss the purchases without discount codes. If I also care about those, I should do –

df['discountCode'].value_counts(dropna=False)

which yields –

NaN    5
1.0    5
2.0    3

This is also relevant for nuniqiue. For example, if I want to count the number of unique discount codes a user used – df.groupby("userId").agg(count=("discountCode", lambda x: x.nunique(dropna=False)))

See more here – https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

2. Margin on Row \ columns  only

 Following the above example, assume you want to know for each discount code which users used it and for each user which discount code she used. Additionally you want to know has many unique discount codes each user used and how many unique users used each code, you can use pivot table with margins argument –

df.pivot_table(index="userId", columns="discountCode",
               aggfunc="nunique", fill_value=0,
               margins=True)

Result –

It would be nice to have the option to get margins only for rows or only for columns. The dropna option does not act as expected – the na values are taken into account in the aggregation function but not added as a column or an index in the resulted Dataframe.

3. plotly backend


Pandas plotting capabilities is nice but you can go one step further and use plotly very easy by setting plotly as pandas plotting backend.  Just add the following line after importing pandas (no need to import plotly, you do need to install it) –

pd.options.plotting.backend = "plotly"

Note that plotly still don’t support all pandas plotting options (e.g subplots, hexbins) but I believe it will improve in the future. 


See more here – https://plotly.com/python/pandas-backend/


4. Categorical dtype and qcut

Categorical variables are common – e.g., gender, race, part of day, etc. They can be ordered (e.g part of day) or unordered (e.g gender). Using categorical data type one can validate data values better and compare them in case they are ordered (see user guide here). qcut allows us to customize binning for discrete and categorical data.

See documentation here and the post the caught my attention about it here – https://medium.com/datadriveninvestor/5-cool-advanced-pandas-techniques-for-data-scientists-c5a59ae0625d

5. tqdm integration


tqdm is a progress bar that wraps any Python iterable, you can also use to follow the progress of pandas apply functionality using progress_apply instead of apply (you need to initialize tqdm before by doing tqdm.pandas()).

See more here – https://github.com/tqdm/tqdm#pandas-integration

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s