Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Akka Streams can be used to process the Wikidata dump in parallel (intenthq.com)
108 points by ArturSoler on June 23, 2015 | hide | past | favorite | 19 comments


On a related note: When I indexed the whole English wikipedia last year, I was surprised, that it was possible to have a JSON version of it indexed[1] and searchable within half an hour on my laptop.

[1] Using parallel bulk indexer for ES: https://github.com/miku/esbulk


How about ~17 minutes (including wikipedia data download and extraction time)! Using json-wikipedia and lbzip2

[1] https://github.com/diegoceccarelli/json-wikipedia


Thanks, JSON exports make wikipedia data much more approachable.


I wrote a similar thing with an emphasis on multiple analyzed fields (making it slower to index) but much more flexible to query.

https://github.com/andrewvc/wikiparse/tree/java

That being said, when it comes to indexing wikipedia, the indexing can be done well across multiple threads internally by elasticsearch. Multithreading the reading/parsing isn't a huge win. Doing decompression in a separate thread is however.


Yes, ES uses multiple threads nicely. But as you move to 32 or 64 cores - in my experience - a single threaded client won't keep ES/Lucene busy enough.

With SOLR, it's similar:

> Sometimes you need to index a bunch of documents really, really fast. [...] The solution is two-fold: batching and multi-threading

From: http://lucidworks.com/blog/high-throughput-indexing-in-solr/


> Process the whole Wikidata in 7 minutes with your laptop

Wikidata is several magnitudes smaller than Freebase (closed by Google in May) and it won't fit in your RAM (laptop).


As you comment, Freebase is bigger than Wikidata. It is 22GB compressed (250GB uncompressed) while Wikidata is 5GB compressed (49GB uncompressed) [1].

Said that, I believe the process described in the blog post is not loading the whole Wikidata dump into memory and it would work the same to process Freebase or even larger data dumps with your laptop.

From the post: How Akka Streams can be used to process the Wikidata dump in parallel and using constant memory with just your laptop.

[1] https://developers.google.com/freebase/data http://dumps.wikimedia.org/other/wikidata/


What are your favorite large, publicly available datasets?


Biased reply (I'm a data scientist there): Common Crawl[1]. We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone completely free.

[1]: http://commoncrawl.org/



The Cancer Genome Atlas, Ensembl, 1000Genomes.


From their video: The presenter: "Why would you (the assistent lady) be interested in cars?" The assistent: "I'm the perfect chick to be into Masserati."

It's a bit disturbing to see an employee presenting her personal life, kids, interests, and what not. Good job, IntentHQ!

The video: https://www.intenthq.com/resources/interest-fingerprint/


I wouldn't say it was disturbing, but it was definitely cringe worthy. A lot of their blog feels that way. They've gone for an informal corporate approach like using puns[1] and memes[2] as headings. Even that video felt badly scripted; like it was meant to sound like an informal pub conversation but instead it came off awkward and unprofessional.

I'm sure their products are of the highest quality, but their blog isn't a great advert in my opinion.

[1] http://engineering.intenthq.com/2015/06/for-those-about-to-c...

[2] http://engineering.intenthq.com/2015/06/wikidata-akka-stream...


I have to say I have mixed feelings about the video. On one hand I understand there's a whole world of people out there, and I don't mind openness and honesty. Big thumbs to her for being honest and cool. On the other hand, it goes the other way when you're bragging about your awesome product that analyzes people's life and sells that info to corporations.


You're assuming those details aren't made up ;)

The privacy thing didn't really bother me because it's either fake data or she's consented to publishing real data about her - either way it's a considered decision. My issue was just how awkward the presentation was delivered. Maybe that could have been resolved if they used a fictional character like Homer Simpson? But then that would have it's own issues.


Why is it disturbing to see someone who is open and not afraid of a bunch of kids and freaks on the Internet? :)


Could this example have been accomplished with awk and xargs just as fast, with same or less memory usage, in fewer lines of code?

Seems so to me after skimming the article, but maybe I missed an important advantage of using Akka Streams for this task?


Yes, the initial parts of the example could be accomplished with awk and xargs, but as the article goes on to demonstrate, even doing something like printing every nth element would be difficult.

I think the intent was for this to be more of a demonstrative example, and with a more complex, evolving, real-world processing pipeline, Akka streams could be really useful.


Are streaming json parsers that rare?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: