Think outside of the Box Plot

Earlier today, I spoke at DataTLV conference about box plots – what they expose, what they hide, and how they mislead. My slides can be found here, and the code used to generate the plots is here

Key takeaways

  • Boxplots show 5 number statistics – min, max, median, q1 and,q3.
  • The flaws of Box Plots can be divided into two – data that is not present in the visualization (e.g. number of samples, distribution) and the visualization being counter-intuitive (e.g. quartiles is hard to grasp the concept).
  • I choose solutions that are easy to implement, either by leveraging existing packages code or by adding small tweaks. I used plotly.
  • Aside of those adjustment I many times box plot is just not the right graph for the job.
  • If the statistical literacy of your audience is not well founded I would try avoiding using box plot.

Topics I didn’t talk about and worth mentioning

  • Mary Eleanor Hunt Spear –  data visualization specialize who pioneered the development of the bar chart and box plot. I had a slide about her but went too fast, and skipped it. See here.
  • How percentiles are calculated – Several methods exist, and different Python packages use different default methods. Read more –http://jse.amstat.org/v14n3/langford.html

Resources I used to prepare the talk