Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

full-text searching is contentious, it's a rabbit-hole of holy-wars.

I don't believe I'm going to summon cranks with the following:

There's more specialized tools for your usecase than postgres and you should be looking into "n-gram indexing".

Lucene based systems such as elasticsearch are quite popular there's also sphinx and xapian, also fairly widespread.

You need to read the documentation and configure them, they are flexible and need to be tuned to this usecase.

In the end, there is no "correct" way to do things. For instance, sometimes stemming words is the right way to go, but if you are say doing a medical system where two very different medicines could be spelled similar and stem to the same word, a mistake could lead to death, so no, stemming is very bad here.

Sometimes stop words is the way to go, but if people are looking for quotes, such as "to be or not to be" well now you have the empty string, splendid.

So yeah, configure configure configure.

This may bring out the cranks:

If you want to roll-your-own nosql systems like redis and mongo or couch seem to work really well (I've rolled my own in all 3 on separate occasions).

I guarantee there's advanced features of maria and postgres that aren't widely used and some people reading this will confidently claim they are superior but I assure you, that is a minority opinion. Most people go with the other options.

If you ever doubt it, ask the actual developers of postgres or maria on chat. They are extremely nice and way more honest about the limitations then their hardcore fans are. The databases are under constant development with advanced features and you'll learn a lot (really, they are both super chill dev communities, impressively so).

Perhaps your solution (as mine has been) is a hybrid. You can store the existence for instance, in one system and the offsets and chunks in another so you get a parallelizable asynchronous worker pipeline, it's impressively fast when you horizontally scale it. <0.5 sec for multiple terabytes of text (and yes, I'm talking nvme/hundreds of gb of RAM per node/>=10gb network). I've legitimately just done random queries to marvel at the speed

I really hope I save myself from the rock throwing.



> For instance, sometimes stemming words is the right way to go, but if you are say doing a medical system where two very different medicines could be spelled similar and stem to the same word...

FWIW lemmatization may be a good alternative to stemming. Stemming is algorithmic and can generate errors, as you point out; "caring" for example might naively be stemmed to "car". Lemmatization uses a dictionary of senses and their root words to avoid this. For common English, there's Princeton's Wordnet (https://wordnet.princeton.edu/). Lemmatizing technical niches, like medicine, would require an additional dictionary.


> there's also sphinx and xapian, also fairly widespread.

Sphinx is now Manticore, and as luck has it, a Manticore dev is in this thread, offering support: https://news.ycombinator.com/item?id=25890828


Thanks! I really appreciate the pointers. I already had planned to explore some of these and you've expanded and directed the search nicely.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: