Optimising Garbage Collection Overhead in Sigma

ezyang · on July 28, 2015

I mentored Giovanni on the two-step allocator patch, and in the process we discovered that on Windows, page tables for all of your reserved address space are counted towards the memory limits of your process. This also affects Go (https://golang.org/issue/5402 and https://golang.org/issue/5236) which uses a similar trick of reserving virtual address space for its heap. If anyone has ideas for how to deal with this on Windows I think many people would be quite interested :)

amelius · on July 28, 2015

> Sigma, which is part of the anti-spam infrastructure

Sounds like something which can be done as a batch-job with little memory. Or not?

thoughtpolice · on July 28, 2015

Sigma is quite advanced; it's essentially an online DSL where authors can push anti-spam rules (written in Haskell) into live production by reloading code at runtime. Queries are done against Sigma in real time by other services, and in turn, Sigma has to query a lot of other data sources continuously in order to determine whether a rule may fire.

For example, a particular rule about the nature of some of your friends on Facebook may need to query 10 different data sources (different DBs, caches, monitor infrastructure). One of the really nice things about Sigma is that it's built on Haxl, a library for efficient concurrent data access. It can also optimize the typical 'N+1 Query problem' away.

What this means is you can write a program like:

  ids := getAllUserIds  -- fetch from source 1 time
  foreach id as ids {
    x <- getUserFriends id -- N queries, 1 for each id
    ...
  }

Which is simple and naive, yet Haxl can optimize this automatically into a program that will A) batch all of the data accesses together (so instead of running N queries for each ID, each query gets batched into one request for a range of users), B) automatically access each data source concurrently with no programmer intervention, so when queries can execute in parallel they do so, and C) cache the results, so that you aren't re-querying already fetched data.

There's a very good paper by Simon, the author of this blog post, discussing the design of Haxl and its use: http://community.haskell.org/~simonmar/papers/haxl-icfp14.pd... Quite the neat system!

Note: I do not work at Facebook, but I do chat with Simon a bit - this is basically the very high level 20,000 foot view from what I've read from Simon writing on the subject.

evincarofautumn · on July 28, 2015

There are some long-running “batch-jobs” for large computations where it’s acceptable to apply a response retroactively, but most uses of Sigma are synchronous—a client needs to know immediately (more or less) how some bit of content is classified and what action to take in response to it.