Exploratory Data Analysis Course – Draft

Last week I gave an extended version of my talk about box plots in Noa Cohen‘s Introduction to Data Science class at Azrieli College of Engineering Jerusalem. Slides can be found here.

The students are 3rd and 4th-year students, and some will become data scientists and analysts. Their questions and comments and my experience with junior data analysts made me understand that a big gap they have in purchasing those positions and performing well is doing EDA – exploratory data analysis. This reminded me of the missing semester of your CS education – skills that are needed and sometimes perceived as common knowledge in the industry but are not taught or talked about in academia. 

“Exploratory Data Analysis (EDA) is the crucial process of using summary statistics and graphical representations to perform preliminary investigations on data in order to uncover patterns, detect anomalies, test hypotheses, and verify assumptions.” (see more here). EDA plays an important role in everyday life of anyone working with data – data scientists, analysts, and data engineers. It is often also relevant for managers and developers to solve the issues they face better and more efficiently and to communicate their work and findings.

I started rolling in my head how would a EDA course would look like –

Module 1 – Back to basics (3 weeks)

  1. Data types of variables, types of data
  2. Basic statistics and probability, correlation
  3. Anscombe’s quartet
  4. Hands on lab – Python basics (pandas, numpy, etc.)

Module 2 – Data visualization (3 weeks)

  1. Basic data visualizations and when to use them – pie chart, bar charts, etc.
  2. Theory of graphical representation (e.g Grammar of graphics or something more up-to-date about human perception)
  3. Beautiful lies – graphical caveats (e.g. box plot)
  4. Hands-on lab – python data visualization packages (matplotlib, plotly, etc.).

Module 3 – Working with non-tabular data (4 weeks)

  1. Data exploration on textual data
  2. Time series – anomaly detection
  3. Data exploration on images

Module 4 – Missing data (2 weeks)

  1. Missing data patterns
  2. Imputations
  • Hands-on lab – a combination of missing data \ non-tabular data

Extras if time allows-

  1. Working with unbalanced data
  2. Algorithmic fairness and biases
  3. Data exploration on graph data

I’m very open to exploring and discussing this topic more. Feel free to reach out – twitterLinkedIn

Advertisement

5 interesting things (03/11/2022)

How to communicate effectively as a developer. – writing effectively is the second most important skill after reading effectively and one of the skills that can differentiate you and push you forward. If you read only one thing today, read this – 

https://www.karlsutt.com/articles/communicating-effectively-as-a-developer/

26 AWS Security Best Practices to Adopt in Production – this is a periodic reminder to pay attention to our SecOps. This post is very well written and the initial table of AWS security best practices by service is great. 

https://sysdig.com/blog/26-aws-security-best-practices/

EVA Video Analytics System – “EVA is a new database system tailored for video analytics — think MySQL for videos.”. Looks cool on first glance and I can think off use cases for myself, yet I wonder if it could become a production-level grade.

https://github.com/georgia-tech-db/eva

I see it as somehow complementary to – https://github.com/impira/docquery

Forestplot – “This package makes publication-ready forest plots easy to make out-of-the-box.”. I like it when academia and technology meet and this is really usable, also for data scientists’ day-to-day work. The next step would probably be deep integration with scikit-learn to pandas.

https://github.com/lsys/forestplot

Bonus – Python DataViz cookbook – easy way to navigate between the different common python visualization practices (i.e via pandas vs using matplotlib / plotly /  seaborn). I would like to see it going to the next step – controlling the colors, grid, etc. from the UI and then switching between the frameworks but that’s a starting point.

https://dataviz.dylancastillo.co/

roadmap.sh – it is not always clear how to level up your skills, what you should learn next (best practices, technology – which, etc). Roadmap.sh attempts to create such roadmaps. While I don’t agree with everything there, I think that the format and references are nice and it is a good inspiration.

https://roadmap.sh/

Shameless plug – Growing A Python Developer (2021), I plan to write a small update in the near future.

Think outside of the Box Plot

Earlier today, I spoke at DataTLV conference about box plots – what they expose, what they hide, and how they mislead. My slides can be found here, and the code used to generate the plots is here

Key takeaways

  • Boxplots show 5 number statistics – min, max, median, q1 and,q3.
  • The flaws of Box Plots can be divided into two – data that is not present in the visualization (e.g. number of samples, distribution) and the visualization being counter-intuitive (e.g. quartiles is hard to grasp the concept).
  • I choose solutions that are easy to implement, either by leveraging existing packages code or by adding small tweaks. I used plotly.
  • Aside of those adjustment I many times box plot is just not the right graph for the job.
  • If the statistical literacy of your audience is not well founded I would try avoiding using box plot.

Topics I didn’t talk about and worth mentioning

  • Mary Eleanor Hunt Spear –  data visualization specialize who pioneered the development of the bar chart and box plot. I had a slide about her but went too fast, and skipped it. See here.
  • How percentiles are calculated – Several methods exist, and different Python packages use different default methods. Read more –http://jse.amstat.org/v14n3/langford.html

Resources I used to prepare the talk

5 interesting things (1/12/21)

Tests aren’t enough: Case study after adding type hints to urllib3 – I read those posts as thrillers (and some of them are the same length :). This post describes the effort of adding type hints to urllib3 and what the maintainers’ team learned during this process. Super interesting.

https://sethmlarson.dev/blog/2021-10-18/tests-arent-enough-case-study-after-adding-types-to-urllib3

Why you shouldn’t invoke setup.py directly – post by Paul Ganssle, python core developer setuptool. TL;DR of the TL;DR in the post – “The setuptools team no longer wants to be in the business of providing a command-line interface and is actively working to become just a library for building packages”. See the table in the summary section for a quick how-to guide.

https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html

Python pathlib Cookbook: 57+ Examples to Master It (2021) – I had a post about pathlib for a while in my drafts and now I can delete it since this guide is much more extensive. In short, pathlib is part of the Python standard library since Python 3.4 and it provides an abstraction for filesystem paths over different operating systems. If you still work with os for paths this is a good time to switch.

https://miguendes.me/python-pathlib

10 Tips For Using OKRs Effectively – I think a lot about OKRs for my team and moreover on personal OKRs and how to grow both the team and the product. I found this post (and the associated links) insightful.

https://rushabhdoshi.com/posts/2020-06-18-10-tips-for-making-okrs-effective/

How to Choose the Right Structure for Your Data Team – I started with those posts and soon enough read many more posts by y Bar Moses, Co-Founder, and CEO, Monte Carlo. Her posts have two dimensions that are relevant for me – team building (specifically around data-intensive products) and data engineering. If you find at least one of those topics interesting I believe you’ll enjoy her posts.

https://towardsdatascience.com/how-to-choose-the-right-structure-for-your-data-team-be6c1b66a067
https://towardsdatascience.com/7-questions-to-ask-when-building-your-data-team-at-a-hypergrowth-company-dce0c0f343b4

5 interesting things (21/10/21)

4 Things Tutorials Don’t Tell You About PyPI – this hands-on experience together with the explanations is priceless. Even if you don’t plan to upload a package to PyPI anytime soon those glimpses of how PyPI works are interesting.

https://blog.paoloamoroso.com/2021/09/4-things-tutorials-dont-tell-you-about.html

Responsible Tech Playbook – I totally agree with Martin Fowler statement that “Whether asked to or not, we have a duty to ensure our systems don’t degrade our society”. This post promotes the text book about Responsible Tech published by Fowler and his colleagues from Thoughtworks. It also references additional resources such as Tarot Cards of Tech Ethical Explorer.

https://martinfowler.com/articles/2021-responsible-tech-playbook.html

A Perfect Match – A Python 3.10 Brain Teaser – Python 3.10 was released earlier this month and the most talked about feature is Pattern Matching. Read this post to make sure you get it correctly.

https://medium.com/pragmatic-programmers/a-perfect-match-ef552dd1c1b1

How I got my career back on track – careers is not a miracle. That’s totally ok if you don’t want to have one but if you do and have aspirations you have to own it and manage your way there. 

https://rinaarts.com/how-i-got-my-career-back-on-track

PyCatFlow –  A big part of current data is time series data combined with categorical data. E.g., change in the mix of medical diagnosis \ shopping categories over time etc. PyCatFlow is a visualization tool which allows the representation of temporal developments, based on categorical data. Check their Jupyter Notebook with interactive widgets that can be run online.

https://medium.com/@bumatic/pycatflow-visualizing-categorical-data-over-time-b344102bcce2

Growing A Python Developer (2021)

I recently run into a team lead question regarding how to grow a backend Python Developer in her team. Since I also iterated around this topic with my team I already had few ideas in mind.

Few disclaimers before we start. First, I believe that the developer also has a share in the process and should express her interest and aspirations. The team lead or tech lead can direct and light blind spots but does not hold all the responsibility. It is also ok to dive into an idea are a tool that is not required at the moment. They might come in handy in the future and they can inspire you. Second, my view is limited to the areas I work in. Different organizations or products have different needs and focus. Third, build habits to constantly learn and grow – read blogs and books, listen to podcasts, take online or offline courses, watch videos, whatever works for you as long as you keep moving.

Consider the links below as appetizers. Each subject below has many additional resources besides the ones that I posted. Most likely I’m just not familiar with them, please feel free to add them and I’ll update the post. Some subjects are so broad and product dependent, e.g. cloud so I didn’t add links at all. Additionally, when using a specific product \ service \ package read the documentation and make it your superpower. Know Python standard library well (e.g itertoolsfunctoolscollectionspathlib, etc), it can save you a lot of time, effort, and bugs.

General ideas and concepts

  1. Clean code – bookbook summary
  2. Design patterns – refactoring bookrefactoring gurupython design patterns GitHub repo
  3. Distributed design patterns – Patterns of Distributed Systems
  4. SOLID principles – SOLID coding in Python
  5. Cloud
  6. Deployment – CI\CDdockerKubernetes
  7. Version control – git guide
  8. Databases – Using Databases with Pythondatabases tutorials
  9. Secure Development – Python cheat sheet by SnykOWASP

Python specific

  1. Webservices – flask, Django, FastAPI
  2. Testing – Unit Testing in Python — The Basics
  3. Packaging –  Python Packaging User Guide
  4. Data analysis – pandas, NumPy, sci-kit-learn
  5. Visualization – plotly, matlpotlib
  6. Concurrency – Speed Up Your Python Program With Concurrency
  7. Debugging – debugging with PDBPython debugging in VS Code
  8. Dependency management – Comparison of Pip, Pipenv and Poetry dependency management tools
  9. Type annotation – Type Annotations in Python
  10. Python 3.10 – What’s New in Python 3.10?, Why you can’t switch to Python 3.10 just yet

Additional resources

  1. Podcast.__init__ – The weekly podcast about Python and its use in machine learning and data science.
  2. The real python podcast
  3. Top 8 Python Podcasts You Should Be Listening to
  4. Python 3 module of the week
  5. Lazy programmer – courses on Udemy mainly AI and ML using Python
  6. cloudonaut – podcast and blog about AWS

5 interesting things (22/09/21)

Writing a Great CV for Your First Technical Role – a series of 3 parts about best practices, mistakes, and pitfalls in CV showing both good and bad examples. I find the posts relevant not just for first rolls but also as a good reminder when updating your CV.

https://naomikriger.medium.com/writing-a-great-cv-for-your-first-technical-role-part-1-75ffc372e54e

Patterns in confusing explanations – writing and technical writing are superpowers. Being able to communicate your ideas in a clear way that others can engage with is a very impactful skill. In this post, Julia Evans describes 13 patterns of bad explanation and accompanies that with positive examples.

https://jvns.ca/blog/confusing-explanations/

How We Design Our APIs at Slack – not only that I agree with those advices and had some bad experiences with similar issues both as a API supplier and consumer. Many times when big companies describe their architecture and process they are irrelevant to small companies due to cost, lack of data or resources or other reasons ,but the great thing about this post is that it also fits small companies and relatively easy to implement.

https://slack.engineering/how-we-design-our-apis-at-slack/

Python Anti-Pattern – this post describes a bug that is at the intersection of Python and AWS lambda functions. One can say that it is an extreme case but I tend to think it is more common than one would think and may spend hours debugging it. It is well written and very important to know if you are using lambda functions.

https://valinsky.me/articles/python-anti-pattern/

Architectural Decision Records – sharing knowledge is hard. Sometimes what is clear for you is not clear for others, sometimes it is not taken into account in the time estimation or takes longer than expected, other times you just want to move on and deliver. Having templates and conventions make it easier both for the writers and the readers. ADRs answer specific need.

https://adr.github.io/

pandas read_csv and missing values

I read Domino Lab post about “Data Exploration with Pandas Profiler and D-Tale” where they load diagnostic mammograms used in the diagnostic of breast cancer from UCI website. Instead of missiing values the data contains ?. When reading the data using pandas read_csv function naively interpret the value as string value and change the column type to be object instead of float in this case.

In the post mentioned above the authors dealt with the issue in the following way –

masses = masses.replace('?', np.NAN)
masses.loc[:,names[:-1]] = masses.loc[:,names[:-1]].apply(pd.to_numeric)

That is, they first replaced the ? values in all the columns with np.NAN and then convert all the columns to numeric. Let’s call this method the manual method.

If we know the know the non default missing values in advance, can we do something better? The answer is yes!

See code here

Use na_values parameter

df = pd.read_csv(url, names=names, na_values=["?"])

na_values parameter can get scalar, string, list-like or dict parameters. If you pass a scalar, string or list-like parameter all columns are treated the same way. If you pass dict you can specify different set of NaN values per column.

The advantage of this method over the manual method is that you don’t need to convert the columns after replacing the nan values. In the manual method the column types are specified (in the given case they are all numeric), if there are multiple columns types you need to know it and specify it in advance.

Side note – likewise, for non trivial boolean values you can use true_values and false_values parameters.

Use converters parameter

df = pd.read_csv(url, names=names, converters={"BI-RADS": lambda x: x if x!="?" else np.NAN})

This is usually used to convert values of specific columns. If you would like to convert values in all the columns in the same way this is not the preferred method since you will have to add an entry for each column and if new column is added you won’t take care of it by default (this can be both advantage and disadvantage). However, for other use-cases, converters can help with more complex conversions.

Note that the result here is different then the result in the other methods since we only converted the values in one column.

Conclusion

Pandas provides several ways to deal with non-trivial missing values. If you know the non-trivial value in advance you are good to go and na_values is most likely the best way to go.

Performance wise (time) all methods perform roughly the same for the given dataset but that can change as a function on the dataset size (columns and rows), row types, number of non-trivial missing values.

On top of it, make reading documentation your superpower. It can use your tools smarter and more efficient and it can save you a lot of time.

See pandas read_csv documentation here

5 tips for using Pandas

Recently, I worked closely with Pandas and found out a few things that are might common knowledge but were new to me and helped me write more efficient code in less time.


1. Don’t drop the na

Count the number of unique values including Na values.

Consider the following pandas DataFrame –

df = pd.DataFrame({"userId": list(range(5))*2 +[1, 2, 3],
                   "purchaseId": range(13),
                   "discountCode": [1, None]*5 + [2, 2, 2]})

Result

If I want to count the discount codes by type I might use –  df['discountCode'].value_counts() which yields – 

1.0    5
2.0    3

This will miss the purchases without discount codes. If I also care about those, I should do –

df['discountCode'].value_counts(dropna=False)

which yields –

NaN    5
1.0    5
2.0    3

This is also relevant for nuniqiue. For example, if I want to count the number of unique discount codes a user used – df.groupby("userId").agg(count=("discountCode", lambda x: x.nunique(dropna=False)))

See more here – https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

2. Margin on Row \ columns  only

 Following the above example, assume you want to know for each discount code which users used it and for each user which discount code she used. Additionally you want to know has many unique discount codes each user used and how many unique users used each code, you can use pivot table with margins argument –

df.pivot_table(index="userId", columns="discountCode",
               aggfunc="nunique", fill_value=0,
               margins=True)

Result –

It would be nice to have the option to get margins only for rows or only for columns. The dropna option does not act as expected – the na values are taken into account in the aggregation function but not added as a column or an index in the resulted Dataframe.

3. plotly backend


Pandas plotting capabilities is nice but you can go one step further and use plotly very easy by setting plotly as pandas plotting backend.  Just add the following line after importing pandas (no need to import plotly, you do need to install it) –

pd.options.plotting.backend = "plotly"

Note that plotly still don’t support all pandas plotting options (e.g subplots, hexbins) but I believe it will improve in the future. 


See more here – https://plotly.com/python/pandas-backend/


4. Categorical dtype and qcut

Categorical variables are common – e.g., gender, race, part of day, etc. They can be ordered (e.g part of day) or unordered (e.g gender). Using categorical data type one can validate data values better and compare them in case they are ordered (see user guide here). qcut allows us to customize binning for discrete and categorical data.

See documentation here and the post the caught my attention about it here – https://medium.com/datadriveninvestor/5-cool-advanced-pandas-techniques-for-data-scientists-c5a59ae0625d

5. tqdm integration


tqdm is a progress bar that wraps any Python iterable, you can also use to follow the progress of pandas apply functionality using progress_apply instead of apply (you need to initialize tqdm before by doing tqdm.pandas()).

See more here – https://github.com/tqdm/tqdm#pandas-integration

5 interesting things (04/12/2020)

How to set compensation using commonsense principles – yet another artwork by Erik Bernhardsson. I like his analytics approach and the way he models his ideas. His manifest regarding compensation systems (Good/bad compensation systems) is brilliant. I believe most of us agree with him while he put it into words. His modeling has some drawbacks that he is aware of. For example, assuming certainty in employee productivity, almost perfect knowledge of the market. Yet, it is totally worth your time.

https://erikbern.com/2020/06/08/how-to-set-compensation-using-commonsense-principles.html

7 Over Sampling techniques to handle Imbalanced Data – imbalanced data is a common real world scenario, specifically in healthcare where most of the patients don’t have a certain condition one is looking for. Over-sampling is a method to handle imbalanced data, this post describes several techniques to handle it. Interestingly, at least in this specific example, most of the techniques do not bring significant improvement. I would therefore compare several techniques and won’t just try one of them. 
https://towardsdatascience.com/7-over-sampling-techniques-to-handle-imbalanced-data-ec51c8db349f

This post uses a the following package which I didn’t know before (it would be great if it could become part of scikit-learn) – https://imbalanced-learn.readthedocs.io/en/stable/index.html

It would be nice to see a similar post fo downsampling techniques.


Python’s do’s and don’t do – very nicely and written with good examples – 
https://towardsdatascience.com/10-quick-and-clean-coding-hacks-in-python-1ccb16aa571b

Every Complex DataFrame Manipulation, Explained & Visualized Intuitively – can’t remember how pandas function work? great, you are not alone. You can use this guide to quickly remind you how melt, explode, pivot and others work.
https://medium.com/analytics-vidhya/every-dataframe-manipulation-explained-visualized-intuitively-dbeea7a5529e

Causal Inference that’s not A/B Testing: Theory & Practical Guide – Causality is often overlooked in the industry. Many times you developed a model that is “good enough” and move on. However, this might increase bias and lead to unfavourable results. This post suggests a hands-on approach to causality accompanied by code samples.

https://towardsdatascience.com/causal-inference-thats-not-a-b-testing-theory-practical-guide-f3c824ac9ed2