I'm a tedious broken record about this (among many other things) but if you haven't read this Richard Cook piece, I strongly recommend you stop reading this postmortem and go read Cook's piece first. It won't take you long. It's the single best piece of writing about this topic I have ever read and I think the piece of technical writing that has done the most to change my thinking:
You can literally check off the things from Cook's piece that apply directly here. Also: when I wrote this comment, most of the thread was about root-causing the DNS thing that happened, which I don't think is the big story behind this outage. (Cook rejects the whole idea of a "root cause", and I'm pretty sure he's dead on right about why.)
Even better than earlyoom is systemd-oomd[0] or oomd[1].
systemd-oomd and oomd use the kernel's PSI[2] information which makes them more efficient and responsive, while earlyoom is just polling.
earlyoom keeps getting suggested, even though we have PSI now, just because people are used to using it and recommending it from back before the kernel had cgroups v2.
(disclaimer I am CEO of llamaindex, which includes LlamaParse)
Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so well) is to always use and stay on top of the latest SOTA models and tech :) - we blend LLM/VLM tech with best-in-class heuristic techniques.
Some quick notes:
1. I'm glad that LlamaParse is mentioned in the article, but it's not mentioned in the performance benchmarks. I'm pretty confident that our most accurate modes are at the top of the table benchmark - our stuff is pretty good.
2. There's a long tail of issues beyond just tables - this includes fonts, headers/footers, ability to recognize charts/images/form fields, and as other posters said, the ability to have fine-grained bounding boxes on the source elements. We've optimized our parser to tackle all of these modes, and we need proper benchmarks for that.
3. DIY'ing your own pipeline to run a VLM at scale to parse docs is surprisingly challenging. You need to orchestrate a robust system that can screenshot a bunch of pages at the right resolution (which can be quite slow), tune the prompts, and make sure you're obeying rate limits + can retry on failure.
Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.
I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.
In my open source tool http://docrouter.ai I run both OCR and LLM/Gemini, using litellm to support multiple LLMs. The user can configure extraction schema & prompts, and use tags to select which prompt/llm combination runs on which uploaded PDF.
LLM extractions are searched in OCR output, and if matched, the bounding box is displayed based on OCR output.
Disclaimer - Founder of Tensorlake, we built a Document Parsing API for developers.
This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world. Relying on metadata in files just doesn't scale across different source of PDFs.
We convert PDFs to images, run a layout understanding model on them first, and then apply specialized models like text recognition and table recognition models on them, stitch them back together to get acceptable results for domains where accuracy is table stakes.
In my experience, it's more likely it was the approach of the folks at your company that made your controls.
SOC2 (and a bunch of similar regimes) basically boil down to "have you documented enough of your company's approach to things that would be damaging to business continuity, and can you demonstrate with evidence to auditors with low-to-medium technical expertise that you are doing what you've said you'd do". Some compliance regimes and some auditors care to differing degrees about whether you can demonstrate that what you've said you'd do is actually a viable and complete way to accomplish the goal you're addressing.
So the good path is that the compliance regime has some baseline expectation like "Audit logs exist for privileged access", and whoever at your company is writing the controls writes "All the logs get sent to our SIEM, and the SIEM tracks what time it received the logs, and the SIEM is only administered by the SIEM administration team" and makes a nice diagram and once a year they show somebody that logs make it to the SIEM.
One of the bad paths is that whoever is writing the controls writes "We have a custom set of k8s helm charts which coordinate using Raft consensus to capture and replicate log data". This gets you to the bad path where now you've got to prove to several non-technical people how all that works.
Another bad path is that whoever writes the control says "well shit, I guess technically if Jimbo on the IT team went nuts, he could push a malicious update to the SIEM and then log in and delete all the data", and so they invent some Rube Goldberg machine to make that not possible, making the infrastructure insanely more complex when they could have just said "Only the SIEM admins can admin the SIEM" and leaned on the fact that auditors expect management to make risk assessments.
The other bad path is that whoever is writing the controls doesn't realize they have agency in the matter, and so they just ask the auditors what the controls should be, and the auditors hand them some boilerplate about how all the servers in the server farm should run NTP and they should uninstall telnet and make sure that their LAMP stack is patched and whatever else, because the auditors are not generally highly technical. And the control author just runs with that and you end up with a control that was just "whatever junk the auditors have amalgamated from past audits" instead of being driven by your company's stack or needs.
I have an almost exhaustive list [1] of browser based text to diagram tools. Some specialised tools (like https://sequencediagram.org/) so much better at what they do than any generic ones like mermaid.
Our mission at Surge is to build the human infrastructure behind the next wave of AI and LLMs. We’re building a data platform that powers AI teams at OpenAI, Anthropic, Meta, Google, and more. Reinforcement Learning with Human Feedback is the critical technique behind the new generation of AI assistants, and that human feedback comes from us. Our product has been a game-changer for the top AI teams in the world. Here are some examples of our past work:
You’d be joining a small, rapidly growing team of former engineering and ML leaders from Google, Meta, Twitter, and Airbnb. We work in small groups, ship quickly, and value autonomy and ownership. No previous AI experience is required, if you have the engineering skills you can learn what you need on the job.
We're looking for engineers with a few years of experience and you can work out of our offices in SF, NYC, or remotely. Please reach out to us at careers@surgehq.ai with a resume and 2-3 sentences describing your interest. Excited to hear from you!
Prophet Town | Full-Stack Engineer + PM + DevOps/Infra | USA-ONLY REMOTE | Full-time | $240K-$370K annual total comp | English fluency required
I’m the founder, trying to do “enlightened business.” We are a small, fully-remote, sf-bay-area-based, boutique indie tech agency. Our leadership staff are all ex-Fortune 100; everybody codes. Notable recent projects: voltagepark.com and a slackbot for Anduril’s employees.
Unlike earlier HN Who's Hiring posts, we have specific existing clients in mind: total comp is not flexible hourly, but full-time salary with client-tied equity compensation. We are currently filling “Tier 3” (5-10 y/o/e, $240K-$320K annual comp) and "Tier 4" (7-20 y/o/e, $300K-$370K annual total comp) roles:
Engineers: Full stack polyglot but JS/TS heavy (SQL/NextJS/Node/React/Remix), AWS deployments.
Infra+Devops: Kubernetes/Docker, Terraform, AWS service set, common CI/CD options.
PMs: JIRA/Trello, stakeholder wrangling, proof of winning engineer trust and surviving big-org politics.
We’re a worker-first operation. Applicants must meet a high bar; in return, I pledge my personal commitment to finding you interesting work and getting you good pay.
Contact James: hn-hiring@ptown.tech. We are a small shop and can get swamped; nonetheless if you send us something by Friday Oct 4, you will hear something by Saturday Oct 5. Please make sure to include a resume in pdf format, and clarify which of the three positions you are interested in.
UPDATE: we took a snapshot of all responses we had received by 10 pm Pacific on Oct 4, and have sent out an initial email to all applicants. There are quite a lot of you :). Additional applicants may still respond, but we will prioritize first responding to those who had already submitted by our snapshot time.
The fundamentals of software sales haven't changed much since this. While B2C SaaS is different, the B2B platform world is still much as described in this book, and more importantly, the buyers are still the people who were buying when this book was published.
While selling today should have changed, many of the enterprise procurement processes that were being set up as this was published are still the same. That makes this an excellent foundation for understanding how to change it up.
That said, you said building an agency ... so do you mean selling software, or selling the ability to deliver solutions that a company can't get off the shelf?
Bigger problem might be using agents in the first place.
We did some testing with agents for content generation (e.g. "authoring" agent, "researcher" agent, "editor" agent) and found that it was easier to just write it as 3 sequential prompts with an explicit control loop.
It's easier to debug, monitor, and control the output flow this way.
But we still use Semantic Kernel[0] because the lowest level abstractions that it provides are still very useful in reducing the code that we have to roll ourselves and also makes some parts of the API very flexible. These are things we'd end up writing ourselves anyways so why not just use the framework primitives instead?
Thanks for sharing, I like the approach and it makes a lot of sense for the problem space. Especially using existing products vs building/hosting your own.
I was however tripped up by this sentence close to the beginning:
> we encountered a significant challenge with RAG: relying solely on vector search (even using both dense and sparse vectors) doesn’t always deliver satisfactory results for certain queries.
Not to be overly pedantic, but that's a problem with vector similarity, not RAG as a concept.
Although the author is clearly aware of that - I have had numerous conversations in the past few months alone of people essentially saying "RAG doesn't work because I use pg_vector (or whatever) and it never finds what I'm looking for" not realizing 1) it's not the only way to do RAG, and 2) there is often a fair difference between the embeddings and the vectorized query, and with awareness of why that is you can figure out how to fix it.
Transformer is hard to describe with analogies, and TBF there is no good explanation why it works, so it may be better to just present the mechanism, "leaving the interpretation to the viewer". Also, it's simpler to describe dot products as vectors projecting on one another
Excellent point on "Death of a pig" by E.B White. It's the perfect example of a timeless essay, without a "big scientific idea".
For others unaware of it, that essay was written in 1948[1], go read it in full. It starts like this:
"I spent several days and nights in mid-September with an ailing pig and I feel driven to account for this stretch of time, more particularly since the pig died at last, and I lived, and things might easily have gone the other way round and none left to do the accounting."
His list begins with The 10 Day MBA which is an excellent book I recommend highly. It gives a brief overview constituting 90+% of what you need to know. The other books on his list, many of which I have read and enjoyed myself, “merely” give greater detail on the material covered by T10DMBA (except macroeconomics which is rarely MBA-relevant anyway).
The book is humorous, but quite serious about the material.
Nice post, OP! I was super impressed with the Apple's vision framework. I used it on a personal project involving the OCRing of tens of thousands of spreadsheet screenshots and ingesting them into a postgres database. I tried other OCR CPU methods (since macOS and Nvidia still don't play nice together) such as Tesseract but found the output to be incorrect too often. The vision framework was not only the highest quality output I had seen, but it also used the least amount of compute. It was fairly unstable, but I can chalk that up to user error w/ my implementation.
I used a combination of RHetTbull's vision.py (for the actual implementation) [1] + ocrmac (for experimentation) [2] and was pleasantly surprised by the performance on my i7 6700k hackintosh.
I wouldn't call myself a programmer but I can generally troubleshoot anything if given enough time, but it did cost time.
Does "CTO" mean you are the tech lead of a small (single team) engineering organization? Then everything written for staff engineers applies. E.g I've heard good things about "Staff engineer's path" by Tanya Reilly.
Does "CTO" mean you are leading an org that is too large to be hands-on with tech, and need to build an effective structure and culture? Then I second the recommendation for "an elegant puzzle" by Will Larson.
Or does "CTO" mean that you switched from being an engineer to managing a team of engineers? Then everything for new managers applies, for starters I'd recommend "Becoming an effective software engineering manager" by James Stanier, or "Engineering management for the rest of us" by Sarah Drasner.
For some good general material, I'd also recommend the resources that Gergely Orosz makes available for subscribers to his "pragmatic engineer" newsletter. Those are templates for the kind of documents and processes you will most likely need - if you're new to the role, you will not go too wrong by using them, and if you want to create your own they are excellent starting points.
IMO, this announcement is far less significant than people make it out to be. The feature has been available as a private beta for a good few months, and as a public beta (with a waitlist) for the last few weeks. Most of the blind people I know (including myself) already have access and are pretty familiar with it by now.
I don't think this will replace human volunteers for now, but it's definitely a tool that can augment them. I've used the volunteer side of Be My AI quite a few times, but I only resort to that solution when I have no other option. Bothering a random human multiple times a day with my problems really doesn't feel like something I want to do. There are situations when you either don't need 100% certainty or know roughly what to expect and can detect hallucinations yourself. For example, when you have a few boxes that look exactly the same and you know exactly what they contain but not which box is which, Be My AI is a good solution. If it answers your question, that's great, if it hallucinates, you know that your box can only be one of a few things, so you'll probably catch that. Another interesting use case is random pictures shared to a group or Slack channel, it's good enough to let you distinguish between funny memes and screenshots of important announcements that merit further human attention, and perhaps a request for alt text.
This isn't a perfect tool for sure, but it's definitely pretty helpful if you know how to use it right. All these anti-AI sentiments are really unwarranted in this case IMO.
The public Human Genome Project used a group of people but most of the sequence library was derived from a single individual in Buffalo, NY. The celera project also used a group of people but it was mostly Venter's genome
I believe more recent sequencing projects have used a wider pool of individuals. I think some projects pool all the individuals and sequence them together, while others sequence each individual separately. This isn't really so much of a problem since the large-scale structure is highly similar across all humans and we have developed sophisticated approaches to model the variations in individuals, see https://www.biomedcentral.com/collections/graphgenomes for an explanation of the "graph structure" used to reprsent alternatives in the reference, which can include individual single nucleobase differences, as well as more complex ones such as large deletions in one individual, to rearrangements and even inversions.
Perry Mehrling wrote “The New Lombard Street” covering all the new mechanics of international financial markets. It’s just as fascinating. He taught a course on the material called “Economics of Money and Banking.” I never felt I grokked the interplay between the Fed, USG, and Wall Street until I studied that course.
I still can't stand Pydantic's API and its approach to non-documentation. I respect the tremendous amount of hard work that goes into it, but fundamentally I don't like the developer experience and I don't think I'll ever feel otherwise. I use it because my coworkers like it and I've learned its advanced features because I had to in order to get things done, not because I like it.
I would love to see a FastAPI alternative still using Starlette internally, but using Attrs + Marshmallow + Cattrs + Apispec instead of Pydantic. It would be a little less "fast" to write a working prototype, but I'd feel much more comfortable working with those APIs, as well as much more comfortable that my dependencies are well-supported and stable.
The problem of course is not that gluing those things together is hard. The problem is that now someone has put untold hundreds of person-hours into FastAPI, and replicating that level of care, polish, bugfixes, feature requests, etc. is difficult without putting in those hundreds of person-hours yourself.
> It is really not so repulsive to see the poor asking for money as to see the rich asking for more money. And advertisement is the rich asking for more money. A man would be annoyed if he found himself in a mob of millionaires, all holding out their silk hats for a penny; or all shouting with one voice, “Give me money.” Yet advertisement does really assault the eye very much as such a shout would assault the ear. “Budge’s Boots are the Best” simply means “Give me money”; “Use Seraphic Soap” simply means “Give me money.” It is a complete mistake to suppose that common people make our towns commonplace, with unsightly things like advertisements. Most of those whose wares are thus placarded everywhere are very wealthy gentlemen with coronets and country seats, men who are probably very particular about the artistic adornment of their own homes. They disfigure their towns in order to decorate their houses.
(OP here) - yeah i know, but i also know how AI twitter works so I put both the headline and the caveats. i always hope to elevate the level of discourse by raising the relevant facts to those at my level/a little bit behind me in terms of understanding. think theres always a fine balance between getting deep/technical/precise and getting attention and you have to thread the needle in a way that feels authentic to you to do this "job"
ultimately my goal is to Learn in Public and demonstrate to experts that spending their time teaching/sharing with me is a good use of time because i will augment/amplify/simplify their message.
(pls give the full podcast a listen/read/watch, George went deep on tinygrad/tinybox/tinycorp and there's lots there he IS the authority on, and people are overly fixated on the GPT4 rumor https://www.latent.space/p/geohot#details )
RedReader is the best Reddit client I've found on Android, Apple's Mail app is really nice, since it lets me navigate within a conversation to the next or previous message (wish Gmail did that), Mona is the best Mastodon client on iOS (I can favorite or reblog just by swiping left or right with three fingers. That's not even possible on Android because TalkBack's scrolling commands work on gesture position not on TalkBack focus position), Feeder is the best RSS reader on Android, Lire is great on iOS although I don't like having to clean up the article list after marking a few as read. Tusky on Android is a good app that uses accessibility actions. I love how in Google's clock app I can dismiss an alarm by using a volume button (while the device's screen is off, but once you hit the power button you have to find and double tap the dismiss button and hope the alarm sound is quieter than your TalkBack voice). I don't know how it is now, but when I was still using Instacart, if I'm in a list of food items and I wanted to add one to my cart, I could just swipe up and it'd be added. If I wanted to remove an item, I'd just swipe down and it'd be removed. That was refreshingly simple.
There are a lot of games on iOS that are accessible, like DiceWorld, which is also pretty accessible on Android but doesn't have the nice-to-have features of iOS, like using the Magic Tap (double tap with two fingers) from anywhere on the screen to roll the dice. Android can't do that because its two finger double tap only works as the play/pause button on a headset, just playing/pausing media instead of a more general "make something happen" command in apps.
Some more action-oriented games, like Mortal Kombat, are becoming more accessible. On iOS, since apps can define "direct touch interaction" areas of the screen, which makes VoiceOver ignore part of the screen so that a tap is passed directly to the app, rather than going through VoiceOver. Using this, one can attack and block instantly while in action parts of the game, with the game sending announcements through VoiceOver, like "swipe up" or "you win". Then, the screen changes, and direct touch area is deactivated, and VoiceOver control is reassurted, so that the player can navigate the interface. It shows really amazing promise for mobile gaming for the blind in the future. And Android doesn't have anything like the direct touch area, unless the app wants to declar itself an accessibility service, which I believe is how the TalkBack Braille keyboard does it from the code on Github. Of course that code is like 6 months behind the actual release and definitely behind the any betas they've released, but still useful for seeing how TalkBack does things.
Of course, each user has differing knowledge of how their screen reader works. I find that many more Android users know more about TalkBack, since it has a built-in tutorial that turns on when TalkBack is launched. VoiceOver on iOS does not have a tutorial that teaches everything one can do, all the commands and such, so are usually unaware of things like image recognition, Braille Screen Input (typing in Braille on the phone screen), or all the commands possible (especially the rotor and using actions within apps).
There's a lot of good research and writing on this topic. This paper, in particular has been really helpful for my cause: https://dl.acm.org/doi/pdf/10.1145/3593856.3595909
It has a lot going for it: 1) it's from Google, 2) it's easy to read and digest, 3) it makes a really clear case for monoliths.