How Akka Streams can be used to process the Wikidata dump in parallel

mtrn · on June 23, 2015

On a related note: When I indexed the whole English wikipedia last year, I was surprised, that it was possible to have a JSON version of it indexed[1] and searchable within half an hour on my laptop.

[1] Using parallel bulk indexer for ES: https://github.com/miku/esbulk

mhuffman · on June 23, 2015

How about ~17 minutes (including wikipedia data download and extraction time)! Using json-wikipedia and lbzip2

[1] https://github.com/diegoceccarelli/json-wikipedia

mtrn · on June 23, 2015

Thanks, JSON exports make wikipedia data much more approachable.

andrewvc · on June 23, 2015

I wrote a similar thing with an emphasis on multiple analyzed fields (making it slower to index) but much more flexible to query.

https://github.com/andrewvc/wikiparse/tree/java

That being said, when it comes to indexing wikipedia, the indexing can be done well across multiple threads internally by elasticsearch. Multithreading the reading/parsing isn't a huge win. Doing decompression in a separate thread is however.

mtrn · on June 23, 2015

Yes, ES uses multiple threads nicely. But as you move to 32 or 64 cores - in my experience - a single threaded client won't keep ES/Lucene busy enough.

With SOLR, it's similar:

> Sometimes you need to index a bunch of documents really, really fast. [...] The solution is two-fold: batching and multi-threading

From: http://lucidworks.com/blog/high-throughput-indexing-in-solr/

frik · on June 23, 2015

> Process the whole Wikidata in 7 minutes with your laptop

Wikidata is several magnitudes smaller than Freebase (closed by Google in May) and it won't fit in your RAM (laptop).

xomateix · on June 23, 2015

As you comment, Freebase is bigger than Wikidata. It is 22GB compressed (250GB uncompressed) while Wikidata is 5GB compressed (49GB uncompressed) [1].

Said that, I believe the process described in the blog post is not loading the whole Wikidata dump into memory and it would work the same to process Freebase or even larger data dumps with your laptop.

From the post: How Akka Streams can be used to process the Wikidata dump in parallel and using constant memory with just your laptop.

[1] https://developers.google.com/freebase/data http://dumps.wikimedia.org/other/wikidata/

thibaut_barrere · on June 23, 2015

What are your favorite large, publicly available datasets?

Smerity · on June 23, 2015

Biased reply (I'm a data scientist there): Common Crawl[1]. We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone completely free.

[1]: http://commoncrawl.org/

rcpt · on June 24, 2015

This thread is pretty good

http://www.quora.com/Where-can-I-find-large-datasets-open-to...

nextos · on June 23, 2015

The Cancer Genome Atlas, Ensembl, 1000Genomes.

cristianpascu · on June 23, 2015

From their video: The presenter: "Why would you (the assistent lady) be interested in cars?" The assistent: "I'm the perfect chick to be into Masserati."

It's a bit disturbing to see an employee presenting her personal life, kids, interests, and what not. Good job, IntentHQ!

The video: https://www.intenthq.com/resources/interest-fingerprint/

laumars · on June 23, 2015

I wouldn't say it was disturbing, but it was definitely cringe worthy. A lot of their blog feels that way. They've gone for an informal corporate approach like using puns[1] and memes[2] as headings. Even that video felt badly scripted; like it was meant to sound like an informal pub conversation but instead it came off awkward and unprofessional.

I'm sure their products are of the highest quality, but their blog isn't a great advert in my opinion.

[1] http://engineering.intenthq.com/2015/06/for-those-about-to-c...

[2] http://engineering.intenthq.com/2015/06/wikidata-akka-stream...

cristianpascu · on June 23, 2015

I have to say I have mixed feelings about the video. On one hand I understand there's a whole world of people out there, and I don't mind openness and honesty. Big thumbs to her for being honest and cool. On the other hand, it goes the other way when you're bragging about your awesome product that analyzes people's life and sells that info to corporations.

laumars · on June 23, 2015

You're assuming those details aren't made up ;)

The privacy thing didn't really bother me because it's either fake data or she's consented to publishing real data about her - either way it's a considered decision. My issue was just how awkward the presentation was delivered. Maybe that could have been resolved if they used a fictional character like Homer Simpson? But then that would have it's own issues.

dsabanin · on June 23, 2015

Why is it disturbing to see someone who is open and not afraid of a bunch of kids and freaks on the Internet? :)

jimbokun · on June 23, 2015

Could this example have been accomplished with awk and xargs just as fast, with same or less memory usage, in fewer lines of code?

Seems so to me after skimming the article, but maybe I missed an important advantage of using Akka Streams for this task?

thelastnode · on June 24, 2015

Yes, the initial parts of the example could be accomplished with awk and xargs, but as the article goes on to demonstrate, even doing something like printing every nth element would be difficult.

I think the intent was for this to be more of a demonstrative example, and with a more complex, evolving, real-world processing pipeline, Akka streams could be really useful.

MrDosu · on June 23, 2015

Are streaming json parsers that rare?