Exploratory Data Analysis Course – Draft

Last week I gave an extended version of my talk about box plots in Noa Cohen‘s Introduction to Data Science class at Azrieli College of Engineering Jerusalem. Slides can be found here.

The students are 3rd and 4th-year students, and some will become data scientists and analysts. Their questions and comments and my experience with junior data analysts made me understand that a big gap they have in purchasing those positions and performing well is doing EDA – exploratory data analysis. This reminded me of the missing semester of your CS education – skills that are needed and sometimes perceived as common knowledge in the industry but are not taught or talked about in academia. 

“Exploratory Data Analysis (EDA) is the crucial process of using summary statistics and graphical representations to perform preliminary investigations on data in order to uncover patterns, detect anomalies, test hypotheses, and verify assumptions.” (see more here). EDA plays an important role in everyday life of anyone working with data – data scientists, analysts, and data engineers. It is often also relevant for managers and developers to solve the issues they face better and more efficiently and to communicate their work and findings.

I started rolling in my head how would a EDA course would look like –

Module 1 – Back to basics (3 weeks)

  1. Data types of variables, types of data
  2. Basic statistics and probability, correlation
  3. Anscombe’s quartet
  4. Hands on lab – Python basics (pandas, numpy, etc.)

Module 2 – Data visualization (3 weeks)

  1. Basic data visualizations and when to use them – pie chart, bar charts, etc.
  2. Theory of graphical representation (e.g Grammar of graphics or something more up-to-date about human perception)
  3. Beautiful lies – graphical caveats (e.g. box plot)
  4. Hands-on lab – python data visualization packages (matplotlib, plotly, etc.).

Module 3 – Working with non-tabular data (4 weeks)

  1. Data exploration on textual data
  2. Time series – anomaly detection
  3. Data exploration on images

Module 4 – Missing data (2 weeks)

  1. Missing data patterns
  2. Imputations
  • Hands-on lab – a combination of missing data \ non-tabular data

Extras if time allows-

  1. Working with unbalanced data
  2. Algorithmic fairness and biases
  3. Data exploration on graph data

I’m very open to exploring and discussing this topic more. Feel free to reach out – twitterLinkedIn

Advertisement

5 interesting things (03/11/2022)

How to communicate effectively as a developer. – writing effectively is the second most important skill after reading effectively and one of the skills that can differentiate you and push you forward. If you read only one thing today, read this – 

https://www.karlsutt.com/articles/communicating-effectively-as-a-developer/

26 AWS Security Best Practices to Adopt in Production – this is a periodic reminder to pay attention to our SecOps. This post is very well written and the initial table of AWS security best practices by service is great. 

https://sysdig.com/blog/26-aws-security-best-practices/

EVA Video Analytics System – “EVA is a new database system tailored for video analytics — think MySQL for videos.”. Looks cool on first glance and I can think off use cases for myself, yet I wonder if it could become a production-level grade.

https://github.com/georgia-tech-db/eva

I see it as somehow complementary to – https://github.com/impira/docquery

Forestplot – “This package makes publication-ready forest plots easy to make out-of-the-box.”. I like it when academia and technology meet and this is really usable, also for data scientists’ day-to-day work. The next step would probably be deep integration with scikit-learn to pandas.

https://github.com/lsys/forestplot

Bonus – Python DataViz cookbook – easy way to navigate between the different common python visualization practices (i.e via pandas vs using matplotlib / plotly /  seaborn). I would like to see it going to the next step – controlling the colors, grid, etc. from the UI and then switching between the frameworks but that’s a starting point.

https://dataviz.dylancastillo.co/

roadmap.sh – it is not always clear how to level up your skills, what you should learn next (best practices, technology – which, etc). Roadmap.sh attempts to create such roadmaps. While I don’t agree with everything there, I think that the format and references are nice and it is a good inspiration.

https://roadmap.sh/

Shameless plug – Growing A Python Developer (2021), I plan to write a small update in the near future.

Think outside of the Box Plot

Earlier today, I spoke at DataTLV conference about box plots – what they expose, what they hide, and how they mislead. My slides can be found here, and the code used to generate the plots is here

Key takeaways

  • Boxplots show 5 number statistics – min, max, median, q1 and,q3.
  • The flaws of Box Plots can be divided into two – data that is not present in the visualization (e.g. number of samples, distribution) and the visualization being counter-intuitive (e.g. quartiles is hard to grasp the concept).
  • I choose solutions that are easy to implement, either by leveraging existing packages code or by adding small tweaks. I used plotly.
  • Aside of those adjustment I many times box plot is just not the right graph for the job.
  • If the statistical literacy of your audience is not well founded I would try avoiding using box plot.

Topics I didn’t talk about and worth mentioning

  • Mary Eleanor Hunt Spear –  data visualization specialize who pioneered the development of the bar chart and box plot. I had a slide about her but went too fast, and skipped it. See here.
  • How percentiles are calculated – Several methods exist, and different Python packages use different default methods. Read more –http://jse.amstat.org/v14n3/langford.html

Resources I used to prepare the talk

5 interesting things (20/06/2022)

Visualizing multicollinearity in Python – I like the network one although it is not very intuitive at first sight. The others you can also get using pandas-profiling.

https://medium.com/@kenanekici/visualizing-multicollinearity-in-python-b5feedc9b3f1

Advanced Visualisations for Text Data Analysis – besides the suggested charts themselves, it is nice to get to know nxviz. I would actually like to see those charts as part of plotly as well.

https://towardsdatascience.com/advanced-visualisations-for-text-data-analysis-fc8add8796e2
Data Tales: Unlikely Plots – bar chart is boring (but informative :), but sometimes we need to think out of the box plot

https://medium.com/mlearning-ai/data-tales-unlikely-plots-1882c2a903da

XKCDs I send a lot – Is XKCD already an industry standard?

https://medium.com/codex/xkcds-i-send-at-least-once-a-month-1f6e9f9b6b89

5 Tier Problem Hierarchy – I use this framework to think of tickets I write, what is the expected input, output, and complexity, what I expect from each of my team members, etc.

https://typeshare.co/kimsiasim/posts/5-tier-problem-hierarchy-4718

CSV to radar plot

I find a radar plot a helpful tool for visual comparison between items when there are multiple axes. It helps me sort out my thoughts. Therefore I created a small script that helps me turn CSV to a radar plot. See the gist below, and read more about the usage of radar plots here.

So how does it works? you provide a csv file where the columns are the different properties and each record (i.e line) is a different item you want to create a scatter for.

The following figure was obtained based on this csv –

https://gist.github.com/tomron/e5069b63411319cdf5955f530209524a#file-examples-csv

The data in the file is based on – https://www.kaggle.com/datasets/shivamb/company-acquisitions-7-top-companies

And I used the following command –

python csv_to_radar.py examples.csv --fill toself --show_legend --title "Merger and Acquisitions by Tech Companies" --output_file merger.jpeg
Radar plot
import plotly.graph_objects as go
import plotly.offline as pyo
import pandas as pd
import argparse
import sys
def parse_arguments(args):
parser = argparse.ArgumentParser(description='Parse CSV to radar plot')
parser.add_argument('input_file', type=argparse.FileType('r'),
help='Data File')
parser.add_argument(
'–fill', default=None, choices=['toself', 'tonext', None])
parser.add_argument('–title', default=None)
parser.add_argument('–output_file', default=None)
parser.add_argument('–show_legend', action='store_true')
parser.add_argument('–show_radialaxis', action='store_true')
return parser.parse_args(args)
def main(args):
opt = parse_arguments(args)
df = pd.read_csv(opt.input_file, index_col=0)
categories = [*df.columns[1:], df.columns[1]]
data = [go.Scatterpolar(
r=[*row.values, row.values[0]],
theta=categories,
fill=opt.fill,
name=label) for label, row in df.iterrows()]
fig = go.Figure(
data=data,
layout=go.Layout(
title=go.layout.Title(text=opt.title, xanchor='center', x=0.5),
polar={'radialaxis': {'visible': opt.show_radialaxis}},
showlegend=opt.show_legend
)
)
if opt.output_file:
fig.write_image(opt.output_file)
else:
pyo.plot(fig)
if __name__ == "__main__":
main(sys.argv[1:])
view raw csv_to_radar.py hosted with ❤ by GitHub
Parent Company 2017 2018 2019 2020 2021
Facebook 3.0 5.0 7.0 7.0 4.0
Twitter 0.0 1.0 3.0 3.0 4.0
Amazon 12.0 4.0 9.0 2.0 5.0
Google 11.0 10.0 8.0 8.0 4.0
Microsoft 9.0 17.0 9.0 8.0 11.0
view raw examples.csv hosted with ❤ by GitHub
numpy==1.22.4
pandas==1.4.2
plotly==5.8.0
python-dateutil==2.8.2
pytz==2022.1
six==1.16.0
tenacity==8.0.1

Other pie chart

This morning I read “20 ideas for better data visualization“. I liked it very much and specially I found 8th idea – “Limit the number of slices displayed in a pie chart” very relevant for me. So I jumped into the plotly express code and created a figure of type other_pie which given a number (n) and a label (other_label) created a pie chart with n sectors. n-1 of those sectors are the top values according to the `values` column and the other section is the sum of the other rows.

A gist of the code can be found here (check here how to build plotly)

I used the following code to generate standard pie chart and pie chart with 5 sectors –

import plotly.express as px
df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
df.loc[df['pop'] < 2.e6, 'country'] = 'Other countries' # Represent only large countries
pie_fig = px.pie(df, values='pop', names='country', title='Population of European continent')
otherpie_fig = px.other_pie(df, values='pop', names='country', title='Population of European continent', n=5, other_label="others")

And this is how it looks like –

Pie chart
Pie chart
Other pie chart

5 interesting things (21/10/21)

4 Things Tutorials Don’t Tell You About PyPI – this hands-on experience together with the explanations is priceless. Even if you don’t plan to upload a package to PyPI anytime soon those glimpses of how PyPI works are interesting.

https://blog.paoloamoroso.com/2021/09/4-things-tutorials-dont-tell-you-about.html

Responsible Tech Playbook – I totally agree with Martin Fowler statement that “Whether asked to or not, we have a duty to ensure our systems don’t degrade our society”. This post promotes the text book about Responsible Tech published by Fowler and his colleagues from Thoughtworks. It also references additional resources such as Tarot Cards of Tech Ethical Explorer.

https://martinfowler.com/articles/2021-responsible-tech-playbook.html

A Perfect Match – A Python 3.10 Brain Teaser – Python 3.10 was released earlier this month and the most talked about feature is Pattern Matching. Read this post to make sure you get it correctly.

https://medium.com/pragmatic-programmers/a-perfect-match-ef552dd1c1b1

How I got my career back on track – careers is not a miracle. That’s totally ok if you don’t want to have one but if you do and have aspirations you have to own it and manage your way there. 

https://rinaarts.com/how-i-got-my-career-back-on-track

PyCatFlow –  A big part of current data is time series data combined with categorical data. E.g., change in the mix of medical diagnosis \ shopping categories over time etc. PyCatFlow is a visualization tool which allows the representation of temporal developments, based on categorical data. Check their Jupyter Notebook with interactive widgets that can be run online.

https://medium.com/@bumatic/pycatflow-visualizing-categorical-data-over-time-b344102bcce2

Visualization – Data scientist toolkit

Data scientist are said to have better development knowledge than the average statistician and better statistic knowledge than the average developer. However, together with those skills one also needs marketing skills – the ability to communicate your, no so simple job and results to other people. Those people can be the CTO or VP R&D, team members, customers or sales and marketing people. They don’t necessarily share your knowledge or dive into the details as fast as you.

One of the best ways to make data and results accessible is creating visualizations, automatically of course. In this post I’ll review several visualizations tools, mostly for Python with some additional side kicks.

Matplotlib – probably the most known python visualization package. Includes most of the standard charts – bar charts, pie charts, scatters, ability to embed images, etc. Since there are many users using it there are many questions, examples and documentations around the web. However, the downside for me is that it is more complex than it should be. I have used it in several projects and I don’t yet acquired the intuition to fully utilize.

Matplotlib have several extensions including –

graphviz – Designated for drawing graphs. Graph drawing software with python package. pygraphviz is a python package for graphviz which provides a drawing layer and graph layout algorithms. The first downside of this is that you need to download the graphviz software. I have done it several times on several different machines (most of the consist of ubuntu) it never passed smoothly and I was not able to do it only from the command line which make it problematic if one wants to deploy it on remote machines. I believe that it could be done but at the moment I find this process only as an irksome overhead.

Side kicks –

  • PyDot – Implements DOT graph description language. PyDot is basically an interface to interact with PyGraphviz dot layout. The main advantage of the dot files and data is the advantage in standardization – one can create dot file in one process and use it in other process. DOT is an intuitive language which focuses on drawing the graph and not on calculating the graph. I would say that it is the last step in the chain.
  • Networkx – a package for working and manipulating graph. Implements many graph algorithms such as shortest path, clustering, minimum spanning tree, etc. The graphs created in Networkx can be drawn using either matplotlib or pygraphviz and can also create dot files.

Vincent – A relatively new python visualization package. Vincent translates Python to Vega which is a visualization  grammar. I like it because it is easy, interactive and simple to output either as JSON or as HTML . However, I’m not sure that both Vincent and Vega are mature enough at this point to answer all the needs. It is important to mention that Vega is actually a wrapper above D3 which is an amazing tool with growing community.

Additional related tools I’m not (yet) experienced with –

  • xlsxwriter – creating excel files (xlsx format) including embedding charts on those files.
  • plot.ly – very talked about tool for collaborating data and graphing tool which have a Python client. I try to keep my data as private as possible and don’t want to be dependent on internet connection (for example – creating graph with a lot of data) so this is the down side for me in this tool. However, the social \ collaborative aspect of this product is also an important part and the graphing is only one aspect of it.
  • Google charts – same downside as plot.ly – I like to be as independent as possible. However, comparing to plot.ly it looks more mature and has far more options, chart types than plot.ly at this stage and there is also a sand box to play with it. Plot.ly has advantages over Google charts in the ease of usage for non programmers.
  • Bokeh – Nice, interactive charts on large data sets. Maybe the next big thing for plotting in Python.