Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well, to start with, how would you determine that about your distribution in the first place? And if that works well enough, why use a box plot afterwards?


Well usually when you are analyzing some data, you toss it into the most basic chart like a histogram.

And a histogram for the author's example is perfectly acceptable to show that single data series.

But imagine if you have 10 different normal data series and you want to compare their medians and distributions between each other... well are you going to put 10 histograms side by side and expect the reader to compare them? No -- that's where the box and whisker plot shines.


Yes, exactly! Just plot all the bloody data and be done with it. No one is doing this by hand anymore so it is no extra work.

To my mind, if you have a genuine EDA attitude you plot it all.


> Just plot all the bloody data and be done with it

Well no, because you can compare the datasets by eye and say questionable qualitative things about them, but you can't make definitively true quantitative statements about them.

Show me two plots of data points and I can show you two people who will in good faith argue over which one has the higher mean or higher median or higher variance. Because you often can't tell.

The entire point of something like a box plot is that it does part of the quantitative analysis for you. You can see where the median is. You can see the width of the quartiles.


But there are much better ways to do this than box plots! Lots of CS papers use CDF and it's great and very informative once you get used to it (although you do need to get used to them). You can have violin plots with all the box plots elements and more. Even if you want to restrict yourself to quartiles, author's design concepts with narrow/wide bars makes much more visual sense, and still convey exactly the same information as box plots.


It depends on the purpose.

CDF plots are great for plotting a single distributions, but contain way too much information if you want to plot 6 distributions next to each other for easy comparison.

Violin plots are interesting but also quite complicated, since you have to arbitrarily choose a kernel shape and this artificial smoothing can make it look like you have much more data than you really do.

I really don't like the author's "alternative designs" because I think they're even more open to misinterpretation than box plots. It's hard to judge though, because the central problem is that the author is trying to represent a bimodal distribution, and shouldn't be using box plots or the 2 "alternative designs" for that.


Simple, use a histogram.

The author's first histogram clearly shows most of the distribution lies in [20,100), then the [10,20) bin is empty but the [0,10) bin is quite full. Hence, that's not a single-mode distribution. It has two modes, one around [50,60) and the other in [0,10).


Because it's very hard to rationally compare multimodal batches without single test statistics. And they present five summary figures for each batch, each of which are reasonable metrics to compare batches with.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: