December 2022 – Tom Ron

Last week I gave an extended version of my talk about box plots in Noa Cohen‘s Introduction to Data Science class at Azrieli College of Engineering Jerusalem. Slides can be found here.

The students are 3rd and 4th-year students, and some will become data scientists and analysts. Their questions and comments and my experience with junior data analysts made me understand that a big gap they have in purchasing those positions and performing well is doing EDA – exploratory data analysis. This reminded me of the missing semester of your CS education – skills that are needed and sometimes perceived as common knowledge in the industry but are not taught or talked about in academia.

“Exploratory Data Analysis (EDA) is the crucial process of using summary statistics and graphical representations to perform preliminary investigations on data in order to uncover patterns, detect anomalies, test hypotheses, and verify assumptions.” (see more here). EDA plays an important role in everyday life of anyone working with data – data scientists, analysts, and data engineers. It is often also relevant for managers and developers to solve the issues they face better and more efficiently and to communicate their work and findings.

I started rolling in my head how would a EDA course would look like –

Module 1 – Back to basics (3 weeks)

Data types of variables, types of data
Basic statistics and probability, correlation
Anscombe’s quartet
Hands on lab – Python basics (pandas, numpy, etc.)

Module 2 – Data visualization (3 weeks)

Basic data visualizations and when to use them – pie chart, bar charts, etc.
Theory of graphical representation (e.g Grammar of graphics or something more up-to-date about human perception)
Beautiful lies – graphical caveats (e.g. box plot)
Hands-on lab – python data visualization packages (matplotlib, plotly, etc.).

Module 3 – Working with non-tabular data (4 weeks)

Data exploration on textual data
Time series – anomaly detection
Data exploration on images

Module 4 – Missing data (2 weeks)

Missing data patterns
Imputations

Hands-on lab – a combination of missing data \ non-tabular data

Extras if time allows-

Working with unbalanced data
Algorithmic fairness and biases
Data exploration on graph data

I’m very open to exploring and discussing this topic more. Feel free to reach out – twitter, LinkedIn.

	Nicole S on 5 Python NLP pacakges
	blissful4bdd2399fa on CSV to radar plot
	tom on CSV to radar plot
	Matt on CSV to radar plot
	“ – Tom… on 📚 Book club Q1 2024 – 3…

Month: December 2022

Exploratory Data Analysis Course – Draft