On a related note: When I indexed the whole English wikipedia last year, I was s...

mhuffman · on June 23, 2015

How about ~17 minutes (including wikipedia data download and extraction time)! Using json-wikipedia and lbzip2

[1] https://github.com/diegoceccarelli/json-wikipedia

mtrn · on June 23, 2015

Thanks, JSON exports make wikipedia data much more approachable.

andrewvc · on June 23, 2015

I wrote a similar thing with an emphasis on multiple analyzed fields (making it slower to index) but much more flexible to query.

https://github.com/andrewvc/wikiparse/tree/java

That being said, when it comes to indexing wikipedia, the indexing can be done well across multiple threads internally by elasticsearch. Multithreading the reading/parsing isn't a huge win. Doing decompression in a separate thread is however.

mtrn · on June 23, 2015

Yes, ES uses multiple threads nicely. But as you move to 32 or 64 cores - in my experience - a single threaded client won't keep ES/Lucene busy enough.

With SOLR, it's similar:

> Sometimes you need to index a bunch of documents really, really fast. [...] The solution is two-fold: batching and multi-threading

From: http://lucidworks.com/blog/high-throughput-indexing-in-solr/