Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Plotnine: Grammar of Graphics for Python (2019) (datascienceworkshops.com)
139 points by EntICOnc on Jan 28, 2021 | hide | past | favorite | 37 comments


In case anyone is wondering what the big deal is with ggplot / plotnine, I record myself doing hour long data analyses in python with it!

I've noticed that a lot of bootcamp grads can use matplotlib to do very simple plots, but when it comes to iterating on a data analysis (trying different plots, facetting on variables, etc..), they get tripped up quickly.

I'm trying to use a port of dplyr I'm working on (siuba) and plotnine to show what on-the-fly analyses might look like. Can't speak highly enough of plotnine!

https://youtu.be/z6xNKZZMWgU


As a related tool, there is Altair (https://altair-viz.github.io/), which also implements the Grammar of Graphics as well as a Grammar of Interaction.


What does this mean "implements the Grammar of Graphics"?

Plotly really does look like ggplot, but Altair doesn't look nearly as friendly:

  alt.Chart(source).mark_bar().encode(
      alt.X("IMDB_Rating:Q", bin=True),
      y='count()',
  )
I guess I can see how it does the same thing, but the ergonomics don't really appear to be in the same league.


Take a look at the underlying Vega-Lite library (https://vega.github.io/vega-lite/) and the paper that introduces the grammar of interactions (https://idl.cs.washington.edu/papers/vega-lite/) for some insights into the design.

What specifically do you find unergonomic about the design of Altair?


Altair is a cool library and I hope it continues to improve. I like using it, but the main problem I've had with Altair is the size of the plots. Unless you save the plot as an image, the plot is rendered as a Vega-Lite spec JSON object. The JSON spec stores the entire dataset, so it limits you to using 5,000 data points. Otherwise you'll have this insanely large JSON object representing the plot. There are work-arounds like using a data server however.


> the plot is rendered as a Vega-Lite spec JSON object. The JSON spec stores the entire dataset, so it limits you to using 5,000 data points

Use "named datasets", specify data via URL, etc. Do not ever specify data inline with the spec. It might get loaded into a JSON object under the hood regardless, IDK, but it's remarkably faster and simple plots of ~100000 points will work on the average computer/vm


You can disable that if you need to and “insanely” large json objects are still just hundreds of megs or maybe a few gigs, still reasonable.

But I’ve also been able to just preprocess some of the aggregation and calculation before the visualization and that’s worked out ok.

I’ve used it over plotly a few times because there’s no offline/online issue with plotly, and I think the Altair method of saving interactive javascript versions more straightforward than plotly.


Grammar of Graphics is a book[1] that talks about how to talk about graphics.

ggplot is an interpretation of grammar of graphics. Altair is another, separate interpretation.

As far as I can understand, plotnine is an attempt at replicating ggplot.

[1] https://www.springer.com/gp/book/9780387245447


You beat me here! I was surprised that I didn't find "Altair" in the source article. Thanks for posting.


I've had to learn R for some Epi classes this semester and, coming from doing figures/stat analysis in python, it really is a breath of fresh air. Imo, The R ecosystem for stats and figure-making just has much better "defaults" than matplotlib. That being said, I have been incredibly frustrated with R error messages and some syntactical things about the language, so I'm super happy to hear that plotnine exists.


This tutorial is excellent, and skimming it was worth it to me just for the discovery of https://github.com/Phlya/adjustText!

Also, an unfortunate namespace wart is that there's a python package called ggpy that seems to have been abandoned since 2016: https://github.com/yhat/ggpy

"plotnine" being the de facto python equivalent of ggplot2 is not obvious at all, but I'll take it :)


AM I the only one annoyed by plotnine arbitrary use of abbreviations?

Some are slightly more excusable, like aes instead of aesthetic, but given today's editors auto-completion, I find this kind of choice annoying by making the code less readable for new-comers.

But others are more gratuitous to my sense. For example, the colors gradients. 'Blues' for blue gradient, but GnBl for a gradient going from green to blue? How hard is it to type Greens-Blues? It also makes it harder to remember which are abbreviated and which are not.


Those gradient names are annoying, but they are a de facto standard from ColorBrewer[1]. I always have to look at that website to figure out the abbreviation I need in ggplot2.

[1] https://colorbrewer2.org/#type=sequential&scheme=GnBu&n=3


Plotnine uses other packages in the scientific Python ecosystem. That is probably where the abbreviations that irk you come from. In some cases those "abbreviations" have roots 20 years deep!


First time i hear about this. It is definitely pretty but i can not wrap my head around the syntax... It feels wrong. But i am a long time matplotlib user so that should be no surprise.


I found a previous HN submission very helpful: https://evamaerey.github.io/ggplot_flipbook/ggplot_flipbook_...

The author builds up plots step by step, showing the changes to the plot along the way. It's really great at showing what each element contributes to the final plot.


Yeah, the Grammar of Graphics syntax can feel a bit awkward at first, but once it clicks it makes a lot of sense.


Strongly suggest using Let's Plot from JetBrains instead. It's much closer to ggplot2 in api and results.

https://github.com/JetBrains/lets-plot


Is it really closer? plotnine has very few API differences from ggplot2.


Can someone tell me, a longtime a gnuplot user (http://www.gnuplot.info/), when I'd want to use ggplot/plotnine?

Is there a sort of Turing completeness among plotting programs, i.e. can everything ultimately be done on any popular plotting program, or are there things achievable only in some and not others?


I think Gnuplot is strongest when you want a pretty plot of a complicated function (3D, parametric, etc). Grammar of graphics approach is optimized for understanding data sets.

For strengths of plotnine, ggplot, and Altair over Gnuplot (or matplotlib), see section 3.3 of the article, particularly the example

ggplot(data=mpg) + geom_point(mapping=aes(x="displ", y="hwy", color="class"))

You can easily replace 'color' with 'shape', or even faceting (getting a separate plot for each class).

I do not know Gnuplot very well -- what's the easiest Guplot way to do this? [1] I do know that when I was looking at data sets, switching from plain matplotlib (where you'd have to plot in a loop over each class) to the grammar of graphics style was a breath of fresh air for me.

Separately, and less interestingly, if you are already using Python or Jupyter, Gnuplot isn't as smoothly integrated into that ecosystem.

[1]: The first thing I found on the website is http://gnuplot.sourceforge.net/demo_5.4/varcolor.html, but perhaps there's a better or more minimal example?


A lot of people swear by ggplot and other grammar of graphics tools, so I have an open question. Let's take an example from the page:

  ggplot(data=mpg) +\
  geom_point(mapping=aes(x="displ", y="hwy", alpha="manufacturer"))
How much easier is that from

  g = ggplot(data=mpg)
  g.geom_point(mapping=aes(x="displ", y="hwy", alpha="manufacturer"))
Or, for instance:

  ggplot(mpg, aes("displ", "hwy")) +\
  geom_point(aes(color="class")) +\
  geom_smooth(se=False) +\
  labs(title="Fuel efficiency generally decreases with engine size")
VS:

  g = ggplot(mpg, aes("displ", "hwy"))
  g.geom_point(aes(color="class"))
  g.geom_smooth(se=False)
  g.title = "Fuel efficiency generally decreases with engine size"

What am I missing?


One critical piece methods miss is they can't decentralize contribution. For example, the gganimate package in R gives user new ggplot functions. With a `+` users can use functions from any package, so the gganimate approach works.

With method chaining the gganimate author would have to mutate some class, and users would have to load all methods (vs importing what you need).


In general, the code philosophy behind ggplot2 and related tools (the so-called "tidyverse" in R) embraces functional programming, in particular doing computation by pure composition of smaller computations.

Using the "+" operator to denote composing parts of visualizations is not the greatest syntax but I think we're basically stuck with it for a bit due to historical baggage. See this note from the creator of ggplot, Hadley Wickham: https://community.rstudio.com/t/why-cant-ggplot2-use/4372/7


Maybe the new R native pipe operator will fix the issues? >|


It may make more sense when you see analysts writing & sharing a lot of code sessions, especially via notebooks. Functional plotting ends up helping a lot! For big graph-y graphs, we made pygraphistry that way, which enables multi-cell flows like:

```

df = cudf.read_csv('1GB.csv').drop_duplicates(['user_ip', 'click'])

g1 = graphistry.edges(df, 'user_ip', 'click')

g1.plot()

g2 = g1.encode_point_color('risk', ['blue','yellow','red])

g2.plot()

g2.edges(cudf.read_csv('file2.csv')).plot() # reuse g2's color settings

g1.edges(cudf.read_csv('file2.csv')).plot() # ... or just g1's graph shape

```

Being able to 'fork' plots and interactively swap in different data / encodings is super great over the course of a session. You can always go back to an earlier one as you make progress. Likewise, you can rerun notebook cells and read them top-to-bottom without worrying too much.

So while we're looking at some V2 additions, maybe supporting R, and updating some of the core (more automatic GPU goodness!)... we're definitely keeping the compositional style.

Interesting nit: Libraries copying the original grammar of graphics can likely benefit from friendlier functional DSL presentation styles. As is, I think they make it much harder to read + write, undercutting much of the productivity potential. I love the academic concept of making everything a composable value, but doing naked composition over a massive namespace of diverse types.. is super confusing to read + write.

Learning from pandas & jquery, we ended up instead steering users to chaining for the typical case: `g.bind(...).edges(...).nodes(...).encode(...).plot()`. It's functional so you can always do `g_intermediate = g...` and likewise still do first-class GoG-syntax-style things with them of you really want `f(g._bindings)`. However, those are the minor case, and people doing them make code harder to read + write:

-- Reading GoG code is confusing: In `x + f(y)`, often unclear what x, y, and f(y) are, and more so in dynamic languages like Python + R that they're used in. In `g.bind(..).encode(...).plot(...)`, each composition is pretty obvious in the typical case, and you can always read back or do first-class in the atypical case.

-- GoG plot authoring is jarring: When doing `x + ...`, tab complete doesn't get you far. If tab complete does somehow kick in, you are dealing with a big namespace dump. Instead, I see people turn to google for almost every step! In contrast, table complete on `g.nodes(df)...` will pull up the most likely next settings to add, and then again for the arguments to fill into whatever command you pick.

GoG defaults to those for the typical case, vs atypical one, so a 2nd-class imperative API may be easier. But with chaining, we get functional composition without losing straight-line reading and tab-complete. Best of both worlds!


If you aren't accustomed to using classes or python in general, the style of ggplot seems more natural I think. The cognitive load is also slightly less looking at ggplot code in my opinion.

But if you use python already, it wouldn't matter.


The article mentions the official ggplot book, which previously existed only in dead tree form. Now, however, a new edition is available online: http://ggplot2-book.org/


My colleague who is a pro at R gave us some internal training on using this in Python. Never going back to Matplotlib again!


what's with the matplotlib hate on this post? it's an extremely mature piece of technology, but maturity itself isn't the only reason why libraries like ggplot and altair aren't the default choice for python plots. as per many topics, most plots are just quick drawings and plt.scatter(x, y) is not just quicker, but, by far, much more intuitive to me than anything involving the word "geom" (ggplot2) or "encode" (altair).

some here say "i haven't gone back to mpl". oh i have. every single time i tried any of these alternatives.


Wow. Very useful.

Keeping a bookmark for when I need it and I will use this sometime in future I am sure.


As a data analyst before, I enjoyed ggplot a slot. It really is next level data visulization stuff.

I tried to find something like ggplot in Python but failed to find one. Good to know that there is a python version of it.


I use this daily and love it. I used to use R just for EDA and visualization and Python for the rest but plotnine hanged that for me so I can do everything in one language. So much better than Matplotlib IMO.


It looks impressive. I am just wondering if it supports NetCDF (or if it is in the roadplan). In particular, things like pcolormesh and displaying grids on a map.


Tangentially, are there any good native plotting libraries in Julia? I haven't found one.


Julia's GoG package is actually even more popular than plotnine: http://gadflyjl.org/


The best ggplot2-like option in Python, by far. I use it constantly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: