Here is a hexbin of a highly correlated nonlinear relationship (y=x^2 + gaussian...

cscheid · on May 26, 2012

Just FYI, the colormap you picked is terrible. There's many experiments to back this claim:

http://gvi.seas.harvard.edu/paper/evaluation-artery-visualiz...

http://www.jwave.vt.edu/~rkriz/Projects/create_color_table/c...

Colormap design is arguably as hard as visualization design. My favorite go-to place for them is http://colorbrewer2.org, but if you need to know only one thing about them, it's that varying hue continuously does not work nearly as well as you might think it does.

In addition, the fundamental reason scatterplots are bad, even with opacity, is essentially that opacity gives rise to an exponential relationship between overplotting and transparency.

There exists an alternative solution, which is to use additive blending and an adaptive linear colorscale, from zero to maximum-overdraw. Unfortunately, at present there exists no data visualization toolkits which support this.

carlob · on May 26, 2012

>In addition, the fundamental reason scatterplots are bad, even with opacity, is essentially that opacity gives rise to an exponential relationship between overplotting and transparency.

can you elaborate on this, I don't see why it would not be linear.

>There exists an alternative solution, which is to use additive blending and an adaptive linear colorscale, from zero to maximum-overdraw. Unfortunately, at present there exists no data visualization toolkits which support this.

I think this _might_ be done in Mathematica, since Graphics objects can be manipulated symbolically, but I might be wrong.

cscheid · on May 27, 2012

It boils down to the (very reasonable) way alpha blending works. Alpha was originally designed to always lie between zero and one, which for compositing makes sense. For scatterplot colormapping, not so much:

If you create a plot with opacity alpha, and which puts N points on top of each other, the remaining 'transparency', that is, the resulting opacity is

1 - (1 - alpha)^N

This is an exponential, which has the unfortunate feature that it's flat for most of the regime, and then spikes in a relatively short scale. The spike is where we get color differentiation (different opacities get different colors). That's bad: color differentiation should be uniform across the scale.

I'm pretty certain Mathematica doesn't do this right either, because it's a pixel-based technique that requires frame buffer manipulation. Instead of rendering with the usual blending operation, you do everything with additive blending, compute the maximum overdraw, and then color-scale linearly.

yummyfajitas · on May 26, 2012

They suggest the rainbow color map is a poor choice because it hides small details. I'm advocating regularization which ensures that you have no small details.

But after reading that paper, I do agree that rainbow has some significant issues. One thing that might be worth trying: make a rainbow color map, but map values to colors in such a way that |x-y|=C x cielab_dist(color(x), color(y)).

gammarator · on May 27, 2012

I believe the 'Spectral' colormap (with a capital S) in matplotlib does exactly that.

Based on the names, it seems [1, 2] (along with all the other capitalized matplotlib colormaps) to be a ColorBrewer [3] colormap, which are all designed with these perceptual considerations in mind [4].

[1] https://github.com/gka/chroma.js/wiki/Predefined-Colors

[2] http://matplotlib.sourceforge.net/examples/pylab_examples/sh...

[3] http://colorbrewer2.com/

[4] http://vis4.net/blog/posts/avoid-equidistant-hsv-colors/

twstws · on May 26, 2012

The superiority of your hexbin follows from setting the density too high for the scatter plot. With points this dense, opacity of 5 or lower is necessary to see the uneven distribution along the x axis. With appropriate opacity, the two plots are pretty similar visually. What's more, the hexbin by definition has lower resolution, since you lump data into discrete bins.

This is an example of bad plotting practice, not a bad plotting method. That said, eyeballing a plot is a weak way to analyze data this dense. That's what statistics are for.

yummyfajitas · on May 26, 2012

Ok, redid it with 5% opacity. You are correct, at that level, the density distribution in x is qualitatively visible.

http://i.imgur.com/geKbT.png

I disagree that the hexbin has lower resolution. The color dimension allows the human eye to easily differentiate regions having similar densities (e.g., 70 vs 50). The difference between deep red and orange is a lot bigger than the difference between dark blue and slightly less dark blue.

The hexbin has lower spatial resolution, it's true, but I'd argue that the spatial resolution you get in a scatterplot is illusory. It doesn't reflect the underlying probability distribution, only the particular sample.

twstws · on May 26, 2012

> The difference between deep red and orange is a lot bigger than the difference between dark blue and slightly less dark blue.

That may be true, but is it better? Does the scatterplot underemphasize differen densities, or does the density plot overemphasize them? I think the scatterplot is more intuitive. There are 21 equal steps between 0 and 100% black. Two points are twice as dark as one, four points are twice as dark as two. Darker means more, lighter means less.

Compare that to shifting from blue to red. Does the shift from orange to red indicate the same density difference as the shift from blue to orange? To decide you need to consult the color scale. The scatterplot is intuitive, and requires no scale.

> The hexbin has lower spatial resolution, it's true, but I'd argue that the spatial resolution you get in a scatterplot is illusory. It doesn't reflect the underlying probability distribution, only the particular sample.

The spatial resolution of a scatterplot represents empirical reality. Each point corresponds to a single observation, with no probability distribution implied or imposed. The density plot, in contrast, imposes a probability distribution, which may or may not reflect the true distribution of the population. The larger the bins, the more likely the displayed pattern is 'illusory'.

yummyfajitas · on May 26, 2012

The spatial resolution of a scatterplot represents empirical reality. Each point corresponds to a single observation, with no probability distribution implied or imposed.

tl;dr; I'm a Bayesian, you are a Frequentist (at least with regards to plotting).

twstws · on May 26, 2012

I don't know what this has to do with bayes vs frequentist. I am not arguing that the data do not have a probabilty distribution. I am arguing that it is better to show all the data when possible, rather than an eye-catching but lossy summary.

sesqu · on May 26, 2012

Just a note, but you're using awfully small hexes and large points there. The two plots approach each other asymptotically, especially if you switch to a monochromatic colormap like some here have argued for.

Histograms tend to work best when the data is well understood, while scatterplots are better for samples from an unknown distribution (incl. lattices, multimodals or even double exponentials).

Try also small samples. When the piecewise uniform prior (on both the data and the intensity) is approximately accurate, histograms are far better, as they guide the eye away from nonexistent patterns. But the bandwidth needs to be judiciously set, and often the data transformed.

Clustering is hard to automate.

yummyfajitas · on May 26, 2012

I have 20-70 data points per hex, at least in the high density regions. While I could have used bigger hexes, I think the difference between a hexbin and a scatterplot is well illustrated.

sesqu · on May 27, 2012

Fair enough. But your point markers are almost half the size of the hex. That makes color the principal difference.

gammarator · on May 26, 2012

(It also helps to set edgecolor='none' to remove the black line around each point.)