Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"The accuracy of PostgreSQL searches is not the best. In the examples that follow I’ve explained how to weight titles over text, but we found a few cases where typing the exact title of a document wouldn’t return that document as the first result."

Isn't this a concern as the main objective of search is to provide accurate results?



Sometimes, Accurate Enough is all you need. It's definitely a concern, but there are other issues to balance in there -- weight of infrastructure from adding new components, etc.


Text-search ranking is customisable[0] and results vary wildly based on the ranking behaviours selected and the weights assigned to different labels. It takes a bit of fine tuning and with the wrong parameters for your data set, you can definitely get results that seem unintuitive.

This should be a concern to the author, but there's no reason to think the search ranking is not working as documented.

[0] http://www.postgresql.org/docs/9.4/static/textsearch-control...


>Isn't this a concern as the main objective of search is to provide accurate results?

If it provides 99% of the results and misses some documents because of some weird bug or encoding issue, then it could very well be good enough for their purposes. Heck, even 90% could be good depending on what they do (e.g. serve articles in an online site).

For other uses, like police or medical records obviously they'd need 100% results.


I know "good enough" is probably not just a good idea with a startup, it's possibly mandatory since there's only so much time and money. But as a user/consumer/customer/target demographic I can't begin to describe how much I disdain knowing that something exists on a site but being unable to find it using search, particularly when I know the exact title. Reddit's search several years ago was quite bad and left a sour taste in my mouth.


I'm already cringing about people in this thread talking about "language detection" and "stemming" as if there are good, easy solutions to them.

Take your favorite language detector, like cld2. Apply it to some real-world language, like random posts on Twitter. Did it detect the languages correctly? Welp, there goes that idea.

(Tweets are too short, you say? Tough. Search queries are shorter. You probably aren't lucky enough for your domain's text to be complete articles from the Wall Street Journal, which is what the typical NLP algorithm was trained on.)

Stemming will always be difficult and subtle. It's useful but it isn't even linguistically well-defined, so you'll have to tweak it a lot. If stemming seems easy, you haven't looked at where it goes wrong for your use case.


PG text search is entirely accurate. The author just screwed something up. Probably forgot to include the regconfig (english vs simple for example)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: