Last week I gave an extended version of my talk about box plots in Noa Cohen‘s Introduction to Data Science class at Azrieli College of Engineering Jerusalem. Slides can be found here.
The students are 3rd and 4th-year students, and some will become data scientists and analysts. Their questions and comments and my experience with junior data analysts made me understand that a big gap they have in purchasing those positions and performing well is doing EDA – exploratory data analysis. This reminded me of the missing semester of your CS education – skills that are needed and sometimes perceived as common knowledge in the industry but are not taught or talked about in academia.
“Exploratory Data Analysis (EDA) is the crucial process of using summary statistics and graphical representations to perform preliminary investigations on data in order to uncover patterns, detect anomalies, test hypotheses, and verify assumptions.” (see more here). EDA plays an important role in everyday life of anyone working with data – data scientists, analysts, and data engineers. It is often also relevant for managers and developers to solve the issues they face better and more efficiently and to communicate their work and findings.
I started rolling in my head how would a EDA course would look like –
Module 1 – Back to basics (3 weeks)
- Data types of variables, types of data
- Basic statistics and probability, correlation
- Anscombe’s quartet
- Hands on lab – Python basics (pandas, numpy, etc.)
Module 2 – Data visualization (3 weeks)
- Basic data visualizations and when to use them – pie chart, bar charts, etc.
- Theory of graphical representation (e.g Grammar of graphics or something more up-to-date about human perception)
- Beautiful lies – graphical caveats (e.g. box plot)
- Hands-on lab – python data visualization packages (matplotlib, plotly, etc.).
Module 3 – Working with non-tabular data (4 weeks)
- Data exploration on textual data
- Time series – anomaly detection
- Data exploration on images
Module 4 – Missing data (2 weeks)
- Missing data patterns
- Hands-on lab – a combination of missing data \ non-tabular data
Extras if time allows-
- Working with unbalanced data
- Algorithmic fairness and biases
- Data exploration on graph data
I’m very open to exploring and discussing this topic more. Feel free to reach out – twitter, LinkedIn.