It really makes sense, and the best part — customers love it. It’s the simple form of pricing, and it’s simple to understand.
In many cases though, you don’t know whether the outcome is correct or not but we just have evals for that.
Our product is a SOTA recall-first web search for complex queries. For example, let’s say your agent needs to find all instances of product launches in the past week.
“Classic” web search would return top results while ours return a full dataset where each row is a unique product (with citations to web pages)
We charge a flat fee per record. So, if we found 100 records, you pay us for 100. Of its 0 then it’s free.
I get sad when I read comments like these, because I feel like HN is the only forum left where real discussion between real people providing real thoughts are happening. I think that is changing unfortunately. The em-dashes and the strange ticks immediate triggers my anti-bodies and devalues it, whether that is appropriate or not.
Not the writing style, but the fact that the em-dashes and strange ticks make it indistinguishable from something AI-generated. At least take the time to replace them with something you can produce easily on a physical keyboard.
Edit:
Well, actually - this kind of writing style does feel quite AI-ish:
> It really makes sense, and the best part — customers love it
The em dashes didn't strike me as LLM because they had spaces on either side, something I don't typically see in LLM outputs as much. But the quote you highlighted is pretty much dead-on for LLM "speak" I must admit. In the end though, I think this is human written.
It might be a Windows vs. MacOS/Linux thing, but regardless - it's becoming a similar kind of pattern that I'm subconsciously learning to ignore/filter out, similar to banner blindness and ads/editorials.
We started doing quarterly RFC at Newscatcher, and it was a big game-changer. We're entirely remote.
I got this idea from Netflix's founder's book "No Rules Rules" (highly recommend it)
Overall, I think the main idea is: context is what matters, and RFC helps you get your (mine, I'm the founder) vision into people's heads a bit more. Therefore, people can be more autonomous and move on faster.
It's probably the best research agent that uses live search. Are you using Firecrawl, I assume?
We're soon launching a similar tool (CatchALL by NewsCatcher) that does the same thing but on a much larger scale because we already index and pre-process millions of pages daily (news, corporate, government files). We're seeing so much better results compared to parallel.ai for queries like "find all new funding announcements for any kind of public transit in California State, US that took place in the past two weeks"
However, our tool will not perform live searches, so I think we're complementary.
I like this approach better TBH - more reliable and robust. It probably satisfies 80% of most customer queries too as most want to query against the same sources
We’re optimising for large enterprises and government customers that we serve, not consumers.
Even the most motivated people, such as OSINT or KYC analysts, can only skim through tens, maybe hundreds of web pages. Our tool goes through 10,000+ pages per minute.
An LLM that has to open each web page to process the context isn’t much better than a human.
A perfect web search experience for LLM would be to get just the answer, aka the valid tokens that can be fully loaded into context with citations.
Many enterprises should leverage AI workflows, not AI agents.
Nice to have // must have. Existing AI implementations are failing because it’s hard to rely on results; therefore, they’re used for nice-to-haves.
Most business departments know precisely what real-world events can impact their operations. Therefore, search is unnecessary; businesses would love to get notifications.
The best search is no search at all. We’re building monitors – a solution that transforms your catchALL query into a real-time updating feed.
no no, they want to use it on external data, we do not do any internal data.
I'll give a few examples of how they use the tool.
Example 1 -- real estate PE that invests in multi-family residential buildings.
Let's say they operate in Texas and want to get notifications about many different events. For example, they need to know about any new public transport infrastructure that will make specific area more accessible -> prices wil go up.
There are hundreds of valid records each month. However, to derive those records, we usually have to sift through tens of thousands of hyper-local news articles.
Example 2 -- Logistics & Supply Chain at F100
Tracking of all the 3rd party providers, any kind of instability in the main regions, disruptions at air and marine ports, political discussions around the regulation that might affect them, etc. There are like 20-50 events, and all of them are multi-lingual at global scale.
thousands of valid records each week, millions of web pages to derive those from.
Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)
Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?
> assuming I need 1k tokens/second throughput (on each, so 20 x 1k)
3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.
You also need space in VRAM for what is required to support the context window; you might be able to do a model that is 14GB in parameters with a small (~8k maybe?) context window on a 16GB card.
I legitimately cannot think of any hardware that will get you to that throughput over that many streams with any of the hardware I know of (I don't work in the server space so there may be some new stuff I am unaware of).
I don't think you can get 1k tokens/sec on a single stream using any consumer grade GPUs with a 20b model. Maybe you could with H100 or better, but I somewhat doubt that.
My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.
(answer for 1 inference)
Al depends on the context length you want to support as the activation memory will dominate the requirements. For 4096 tokens you will get away with 24GB (or even 16GB), but if you want to go for the full 131072 tokens you are not going to get there with a 32GB consumer GPU like the 5090. You'll need to spring for at the minimum an A6000 (48GB) or preferably an RTX 6000 Pro (96GB).
Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.
Is that really a product? I think it should be solved through workflow and policies rather than providing this to a 3rd party provider. But I might be wrong.
Will, Jeff, I am a BIG Exa fan. Congrats on finally doing your HN Launch.
I think NewsCatcher (my YC startup) and Exa aren’t direct competitors but we definitely share the same insight — SERP is not the right way to let LLM interact with web. Because it’s literally optimized for humans who can open 10 pages at most.
What we found is that LLMs can sift through 10k+ web pages if you pre-extract all the signals out of it.
But we took a bit of a different angle. Even though we have over 1.5 billion of news stories only in our index we don’t have a solution to sift through as your Websets do (saw your impressive GPU cluster :))
So what we do instead is we do bespoke pipelines for our customers (who are mostly large enterprise/F1000). So we fine-tune LLMs on specific information extraction with very high accuracy.
Our insight: for many enterprises the solution should be either a perfect fit or nothing. And that’s where they’re ok to pay 10-100x for the last mile effort.
P.S. Will, loved your comment on a podcast where you said Exa can be used to find a dating partner.
Thanks Artem! That makes sense to specialize for the biggest customers. Yes, a lot of problems in the world would be improved by better search, including dating.
It’s just breaks my head. We’ve build LLMs that can process millions of pages at a time. But what we give them is a search engine that is optimized for humans.
It’s like giving a humanoid robot access to a keyboard with a mouse to chat with another humanoid robot.
Disclaimer: I might be biased as we’re kind of building the fact search engine for LLMs.
In many cases though, you don’t know whether the outcome is correct or not but we just have evals for that.
Our product is a SOTA recall-first web search for complex queries. For example, let’s say your agent needs to find all instances of product launches in the past week.
“Classic” web search would return top results while ours return a full dataset where each row is a unique product (with citations to web pages)
We charge a flat fee per record. So, if we found 100 records, you pay us for 100. Of its 0 then it’s free.