Functional Programming for Array-Based Parallelism

lmeyerov · on June 16, 2020

One of the most elegant paper series I've ever read is her "More Types for Nested Data Parallelism" (https://dl.acm.org/doi/abs/10.1145/351240.351249) + the preceding NESL / segmented scan papers by Guy Blelloch (https://www.cs.cmu.edu/~guyb/papers/Nesl2.0.pdf) . They helped lead to our GPU journey for starting Graphistry, and why we bet on GPU dataframes (ex: RAPIDS.ai) eventually overtaking Spark and friends even before DataBricks was a company :) If you do pandas/tensorflow/numpy/sql/etc., this stuff is amazing.

Fun to see come full circle here!

flafla2 · on June 16, 2020

Seconded! I had the pleasure of taking Guy Belloch's parallel programming class at CMU as a sophomore. A large majority of the homework assignments involved leveraging the segmented scan operator to solve all sorts of classical parallel programming problems.

Really eye opening stuff! I am still impressed that prefix sum (generalized over all associative operators, not just +) can be done in O(log n) span. And the algorithm is not too complicated either, I was able to get the gist of it even as an undergrad.

throwaway_pdp09 · on June 17, 2020

I read that the prefix scan is great so I read somebody's thesis. I a) didn't really get it and b) felt it was doing some complicated stuff for little value. Doubtless I'm wrong, computer scientists tend to be smarter than me, can you give me some insight why it's so valuable?

lmeyerov · on June 17, 2020

There are two really cool things going on here. Also, keep in mind NESL was published 1992 (and thus made in ~1990), which was not far from when the Connection Machine + Cray's got hyped and largely failed to mainstream parallel compute.

1. Data parallelism, esp. for super power efficient parallel hardware (SIMD, GPU, all the stuff that lets you go 10-100X over Spark for the same budget) is hard for computing over "irregular" data like trees and graphs. Prefix scan showed a basic way to do it for a bunch of basic math over these. A lot of the 90s / early 2000s became research into covering all sorts of weird cases. This shows data parallelism is broadly applicable so worth pushing forward on.

2. Guy B, and then Manuel C & Gabriele K, then showed you don't have to be a weird HPC person hand-rolling Fortran nor rely on weird one-off special case libraries to use the above. We can build libraries of high-level primitives -- think map/reduce -- and even to the level that compilers can make it look like "normal" haskell. Most things can compile down to prefix scan. This is similar to Impala/Spark figuring out RDDs/dataframes and doing SQL on top, except again, for 10-100X better performance, and you don't have to contort everything into 1970s SQL. RE:Prefix scan in particular, we don't _have_ to implement all the algs as prefix scan, but it's reliable base case, and where possible (most cases!), we nowadays instead swap in more efficient hand-written CUDA library calls.

Nowadays, when GPUs are in the cloud, Moore's Law stalled out for CPUs but is still going strong for GPUs, and most big data systems realize their latency and data shuttling/serialization sucks, all ^^^ is a big deal. See 10yr AWS price chart I calculated: https://twitter.com/lmeyerov/status/1247382156487213057/phot... . You can just code your Python dataframe code in https://rapids.ai/ and not really think about it, just know you're doing better than the hadoop stuff underneath and without managing a dirty horde. The existence of NESL and the research area that came after means we're still at the beginning of swapping out crappy CPU software for better data parallel stuff, and _it's doable_.

throwaway_pdp09 · on June 17, 2020

Thanks for a thorough answer!

sitkack · on June 17, 2020

I recently tried getting back into Haskell, the goal as to write Haskell and drop into Rust for speed. Even with Stack,the toolchains and the library compatibility was too daunting to figure out, cargo just works, with Rustup and Jetbrains toolbox, I can have a fully configured dev environment that can build everything I need in under 5 minutes.

Now if Cargo could install Haskell and its libs, I would jump on it in an instant.

zoltar · on June 17, 2020

Reminds me of Guy Steele tackling this trivial problem on stage: https://www.youtube.com/watch?v=ftcIcn8AmSY

munro · on June 16, 2020

Great slides near the end, this project looks really fun, and it's cool to see parallelism applied to simple vanilla code--rather than having to build the transformation through some eDSL. That said, I hate to be a downer, but I've really moved away from Haskell, so I don't see myself playing with this. Lately it's PyTorch or PySpark, with an eye on Rust, and mildly Julia.

vmchale · on June 16, 2020

> but I've really moved away from Haskell

Eh? Some sort of ML (or maybe Lisp) is going to be at the cutting edge in something like this where you have JIT-compilers and such.

> Lately it's PyTorch or PySpark

There's some GPU stuff you can do in Python via Futhark. It's very cool.

maskboy · on June 16, 2020

why's that?

tombert · on June 16, 2020

I saw her keynote of this talk at Lambda Days, and it was easily my favorite talk there this year (except mine, of course :) ).

I find it kind of weird that, outside of academic circles and the like, Haskell is largely dismissed as "not practical"; I think the "cabal hell" days before Stack came out really hurt the language and its adoption. Libraries like Accelerate really demonstrate how powerful something like Haskell can be when applied correctly.

vmchale · on June 16, 2020

I'd love for this to get more mainstream attention! J, Futhark, Accelerate all have very neat stuff under the hood.

eggy · on June 16, 2020

I second J and Futhark. I love J, and I wish there was a way to use the GPU with J. I have looked briefly at APL -> TAIL -> Futhark, but I don't know enough to do something useful with that particular toolchain or wow myself enough to keep going. More studying...

smabie · on June 17, 2020

How about co-dfns? CUDA accelerated APL!

https://github.com/Co-dfns/Co-dfns

eggy · on June 17, 2020

Yes that is Aaron Hsu's work for APL on the GPU. I was hoping for a J on the GPU toolchain.

the_optimist · on June 17, 2020

how about:

https://github.com/Pascal-J/Jfire

eggy · on June 29, 2020

Thanks! I had heard of ArrayFire, but I didn't know about this J implementation. There goes my week!

teleforce · on June 17, 2020

Lesser known Single Assignment C (SAC) is using array based functional programming as well [1],[2]. As the name implied it uses C/Algol based syntax for parallel programming of multi-core and GPU.

Similar to Haskell it's originated in academic research. It utilizes static single assignment (SSA) concept that can be used for both imperative and functional programming [3]. Neat stuff. Hopefully someone can emulate this with a modern open source hybrid functional language like Scala or D.

[1]http://www.sac-home.org/doku.php

[2]https://www.microsoft.com/en-us/research/video/sac-off-the-s...

[3]https://www.cs.princeton.edu/~appel/papers/ssafun.pdf

jiggunjer · on June 17, 2020

I'm no expert on parallel algorithms but, based on some of the examples I've seen, I think MATLAB might be suitable.

At least when I program in MATLAB instead of python I think very differently.

auggierose · on June 16, 2020

She is also one of the inventors of associated types.

angel_j · on June 16, 2020

Would it give me academic credibility to have written parallelizable conditionals without 'if' statements?