Hacker Newsnew | past | comments | ask | show | jobs | submit | artembugara's commentslogin

It really makes sense, and the best part — customers love it. It’s the simple form of pricing, and it’s simple to understand.

In many cases though, you don’t know whether the outcome is correct or not but we just have evals for that.

Our product is a SOTA recall-first web search for complex queries. For example, let’s say your agent needs to find all instances of product launches in the past week.

“Classic” web search would return top results while ours return a full dataset where each row is a unique product (with citations to web pages)

We charge a flat fee per record. So, if we found 100 records, you pay us for 100. Of its 0 then it’s free.


I get sad when I read comments like these, because I feel like HN is the only forum left where real discussion between real people providing real thoughts are happening. I think that is changing unfortunately. The em-dashes and the strange ticks immediate triggers my anti-bodies and devalues it, whether that is appropriate or not.


Do you mean it’s written by AI?

Or just my writing style?


Not the writing style, but the fact that the em-dashes and strange ticks make it indistinguishable from something AI-generated. At least take the time to replace them with something you can produce easily on a physical keyboard.

Edit:

Well, actually - this kind of writing style does feel quite AI-ish:

> It really makes sense, and the best part — customers love it


The em dashes didn't strike me as LLM because they had spaces on either side, something I don't typically see in LLM outputs as much. But the quote you highlighted is pretty much dead-on for LLM "speak" I must admit. In the end though, I think this is human written.


It might be a Windows vs. MacOS/Linux thing, but regardless - it's becoming a similar kind of pattern that I'm subconsciously learning to ignore/filter out, similar to banner blindness and ads/editorials.


Why does it produce different ticks and em-dashes?


Chrome on iPhone


We started doing quarterly RFC at Newscatcher, and it was a big game-changer. We're entirely remote.

I got this idea from Netflix's founder's book "No Rules Rules" (highly recommend it)

Overall, I think the main idea is: context is what matters, and RFC helps you get your (mine, I'm the founder) vision into people's heads a bit more. Therefore, people can be more autonomous and move on faster.


Congrats on the HN Launch!

It's probably the best research agent that uses live search. Are you using Firecrawl, I assume?

We're soon launching a similar tool (CatchALL by NewsCatcher) that does the same thing but on a much larger scale because we already index and pre-process millions of pages daily (news, corporate, government files). We're seeing so much better results compared to parallel.ai for queries like "find all new funding announcements for any kind of public transit in California State, US that took place in the past two weeks"

However, our tool will not perform live searches, so I think we're complementary.

i'd love to chat.


I like this approach better TBH - more reliable and robust. It probably satisfies 80% of most customer queries too as most want to query against the same sources


Oh, I totally see your point.

We’re optimising for large enterprises and government customers that we serve, not consumers.

Even the most motivated people, such as OSINT or KYC analysts, can only skim through tens, maybe hundreds of web pages. Our tool goes through 10,000+ pages per minute.

An LLM that has to open each web page to process the context isn’t much better than a human.

A perfect web search experience for LLM would be to get just the answer, aka the valid tokens that can be fully loaded into context with citations.

Many enterprises should leverage AI workflows, not AI agents.

Nice to have // must have. Existing AI implementations are failing because it’s hard to rely on results; therefore, they’re used for nice-to-haves.

Most business departments know precisely what real-world events can impact their operations. Therefore, search is unnecessary; businesses would love to get notifications.

The best search is no search at all. We’re building monitors – a solution that transforms your catchALL query into a real-time updating feed.


So your customers just want to use this for their own internal data, not external data from the web. Is that correct?


no no, they want to use it on external data, we do not do any internal data.

I'll give a few examples of how they use the tool.

Example 1 -- real estate PE that invests in multi-family residential buildings. Let's say they operate in Texas and want to get notifications about many different events. For example, they need to know about any new public transport infrastructure that will make specific area more accessible -> prices wil go up.

There are hundreds of valid records each month. However, to derive those records, we usually have to sift through tens of thousands of hyper-local news articles.

Example 2 -- Logistics & Supply Chain at F100 Tracking of all the 3rd party providers, any kind of instability in the main regions, disruptions at air and marine ports, political discussions around the regulation that might affect them, etc. There are like 20-50 events, and all of them are multi-lingual at global scale.

thousands of valid records each week, millions of web pages to derive those from.


Hey, would be happy to chat. Shoot us an email at team@webhound.ai and we can set up a time.


done


Disclamer: probably dumb questions

so, the 20b model.

Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?


An A100 is probably 2-4k tokens/second on a 20B model with batched inference.

Multiply the number of A100's you need as necessary.

Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.

Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...


You can batch only if you have distinct chat in parallel,


> > if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each)


> assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.


gpt-oss:20b is ~14GB on disk [1] so fits nicely within a 16GB VRAM card.

[1] https://ollama.com/library/gpt-oss


You also need space in VRAM for what is required to support the context window; you might be able to do a model that is 14GB in parameters with a small (~8k maybe?) context window on a 16GB card.


thanks, this part is clear to me.

but I need to understand 20 x 1k token throughput

I assume it just might be too early to know the answer


I legitimately cannot think of any hardware that will get you to that throughput over that many streams with any of the hardware I know of (I don't work in the server space so there may be some new stuff I am unaware of).


oh, I totally understand that I'd need multiple GPUs. I'd just want to know what GPU specifically and how many


I don't think you can get 1k tokens/sec on a single stream using any consumer grade GPUs with a 20b model. Maybe you could with H100 or better, but I somewhat doubt that.

My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.


(answer for 1 inference) Al depends on the context length you want to support as the activation memory will dominate the requirements. For 4096 tokens you will get away with 24GB (or even 16GB), but if you want to go for the full 131072 tokens you are not going to get there with a 32GB consumer GPU like the 5090. You'll need to spring for at the minimum an A6000 (48GB) or preferably an RTX 6000 Pro (96GB).

Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.


How Macs compare to RTXs for this? I.e. what numbers can be expected from Mac mini/Mac Studio with 64/128/256/512GB of unified memory?


Groq is offering 1k tokens per second for the 20B model.

You are unlikely to match groq on off the shelf hardware as far as I'm aware.



Oh it sounds exactly like what we’d need to embedding multi-page PDFs from government websites.


An absolute legend.

I missed my chance to listen to Black Sabbath in 2015 or 2016 during the Rock am Ring because the last day was cancelled.

I'm happy for what Ozzy did in his sixties and seventies, and what a way to go.

And let's not forget, the most likely reason he's been able to get this far with his lifestyle post-80s and 90s is Sharon


Indeed... The clip for Under the Graveyard sums it up well: https://www.youtube.com/watch?v=iuzyA5gDa4E


Congrats, @tndl

You guys rock! Big fan


What are some startups that help precisely with “feeding the LLM the right context” ?


Is that really a product? I think it should be solved through workflow and policies rather than providing this to a 3rd party provider. But I might be wrong.

[1] https://jdsemrau.substack.com/p/memory-and-context


Not a startup, and doesnt help you still have to choose, but I paid 200usd for RepoPrompt (macOs app)

its a very niche app, and havent used it as much since buying it, but there's that https://repoprompt.com/


Anthropic


cursor ?


Will, Jeff, I am a BIG Exa fan. Congrats on finally doing your HN Launch.

I think NewsCatcher (my YC startup) and Exa aren’t direct competitors but we definitely share the same insight — SERP is not the right way to let LLM interact with web. Because it’s literally optimized for humans who can open 10 pages at most.

What we found is that LLMs can sift through 10k+ web pages if you pre-extract all the signals out of it.

But we took a bit of a different angle. Even though we have over 1.5 billion of news stories only in our index we don’t have a solution to sift through as your Websets do (saw your impressive GPU cluster :))

So what we do instead is we do bespoke pipelines for our customers (who are mostly large enterprise/F1000). So we fine-tune LLMs on specific information extraction with very high accuracy.

Our insight: for many enterprises the solution should be either a perfect fit or nothing. And that’s where they’re ok to pay 10-100x for the last mile effort.

P.S. Will, loved your comment on a podcast where you said Exa can be used to find a dating partner.


Thanks Artem! That makes sense to specialize for the biggest customers. Yes, a lot of problems in the world would be improved by better search, including dating.


Search the web is apparently using SERP.

It’s just breaks my head. We’ve build LLMs that can process millions of pages at a time. But what we give them is a search engine that is optimized for humans.

It’s like giving a humanoid robot access to a keyboard with a mouse to chat with another humanoid robot.

Disclaimer: I might be biased as we’re kind of building the fact search engine for LLMs.


No LLM can process millions of web pages. Maybe you're thinking of something else?


This is a problem I think about often. I’d be curious to know what kind of things you’ve learned / accomplished in that problem space so far.


What makes you think Claude is using a search engine optimized for humans?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: