Earlier today, I spoke at DataTLV conference about box plots – what they expose, what they hide, and how they mislead. My slides can be found here, and the code used to generate the plots is here.
- Boxplots show 5 number statistics – min, max, median, q1 and,q3.
- The flaws of Box Plots can be divided into two – data that is not present in the visualization (e.g. number of samples, distribution) and the visualization being counter-intuitive (e.g. quartiles is hard to grasp the concept).
- I choose solutions that are easy to implement, either by leveraging existing packages code or by adding small tweaks. I used plotly.
- Aside of those adjustment I many times box plot is just not the right graph for the job.
- If the statistical literacy of your audience is not well founded I would try avoiding using box plot.
Topics I didn’t talk about and worth mentioning
- Mary Eleanor Hunt Spear – data visualization specialize who pioneered the development of the bar chart and box plot. I had a slide about her but went too fast, and skipped it. See here.
- How percentiles are calculated – Several methods exist, and different Python packages use different default methods. Read more –http://jse.amstat.org/v14n3/langford.html
Resources I used to prepare the talk