Hacker Newsnew | past | comments | ask | show | jobs | submit | pedrozieg's commentslogin

“In a library on a burner laptop” - but I’ll narrow it down to people who have handed in their notice on a specific day.

A useful way to frame this isn’t “is it worth tens of hours to avoid a future reinstall” but “where do I want my entropy to live”. You’re going to invest time somewhere: either in a slowly-accumulating pile of invisible state (brew, manual configs, random installers) or in a config that you can diff, review and roll back. The former feels free until you hit some cursed PATH/SSL/toolchain issue at 11pm and realize you’ve been paying that tax all along, just in tiny, forgotten increments.

I think where Nix shines isn’t “one laptop every 6 years” but when your environment needs to be shared or recreated: multiple machines, a team, or a project with nasty native deps. At that point, nix-darwin + dev shells becomes infrastructure, not a hobby. You don’t have to go all-in on “my whole Mac is Nix” either: keep GUI apps and casual tools imperative, and treat Nix as the source of truth for the stuff that actually blocks you from doing work. That hybrid model matches what the article hints at and tends to give you most of the upside without turning your personal laptop into a second job.


One of the biggest annoyances I have with doing this with Nix vs another tool is that Nix doesn't natively communicate back state changes so that you can make them reproducible.

If I make a git repo, place '~/.config/newsapp.conf' in there and then symlink it back to '~/.config/', if NewsApp introduces a new variable in its settings I am immediately aware because Git will complain about being dirty. However, Nix will happily just nuke that .conf and rebuild whatever is in your configuration, without letting you know about state. Which is ultimately bad for reproducibility. It's a huge blind spot in Nix.


>I think where Nix shines isn’t “one laptop every 6 years” but when your environment needs to be shared or recreated: multiple machines, a team, or a project with nasty native deps.

I'd like to add the third thing, which is just iteration. It's very tricky to maintain advanced workflows even locally. I'd guess many won't even try to compose things that could work in combination (often self-hosted services), when they know they can't reliably maintain those environments.


What I like about this writeup is that it quietly demolishes the idea that you need DeepMind-scale resources to get “superhuman” RL. The headline result is less about 2048 and Tetris and more about treating the data pipeline as the main product: careful observation design, reward shaping, and then a curriculum that drops the agent straight into high-value endgame states so it ever sees them in the first place. Once your env runs at millions of steps per second on a single 4090, the bottleneck is human iteration on those choices, not FLOPs.

The happy Tetris bug is also a neat example of how “bad” inputs can act like curriculum or data augmentation. Corrupted observations forced the policy to be robust to chaos early, which then paid off when the game actually got hard. That feels very similar to tricks in other domains where we deliberately randomize or mask parts of the input. It makes me wonder how many surprisingly strong RL systems in the wild are really powered by accidental curricula that nobody has fully noticed or formalized yet.


You never needed DeepMind scale resources to get superhuman performance on a small subset of narrow tasks. Deep Blue scale resources are often enough.

The interesting tasks, however, tend to take a lot more effort.


Nice tool. Some of the text boxes, such as sitemap text, are not legible in dark mode - the text is light grey on a white background.

What I like about this approach is that it quietly reframes the problem from “detect AI” to “make abusive access patterns uneconomical”. A simple JS+cookie gate is basically saying: if you want to hammer my instance, you now have to spin up a headless browser and execute JS at scale. That’s cheap for humans, expensive for generic crawlers that are tuned for raw HTTP throughput.

The deeper issue is that git forges are pathological for naive crawlers: every commit/file combo is a unique URL, so one medium repo explodes into Wikipedia-scale surface area if you just follow links blindly. A more robust pattern for small instances is to explicitly rate limit the expensive paths (/raw, per-commit views, “download as zip”), and treat “AI” as an implementation detail. Good bots that behave like polite users will still work; the ones that try to BFS your entire history at line rate hit a wall long before they can take your box down.


Yeah, this is where I landed a while ago. What problem am I _really_ trying to solve?

For some people it's an ideological one--we don't want AI vacuuming up all of our content. For those, "is this an AI user?" is a useful question to answer. However it's a hard one.

For many the problem is simply "there are a class of users that are putting way too much load on the system and it's causing problems". Initially I was playing wack-a-mole with this and dealing with alerts firing on a regular basis because of Meta crawling our site very aggressively, not backing off when errors were returned, etc.

I looked at rate limiting but the work involved in distributed rate limiting versus the number of offenders involved made the effort look a little silly, so I moved towards a "nuke it from orbit" strategy:

Requests are bucketed by class C subnet (31.13.80.36 -> 31.13.80.x) and request rate is tracked over 30 minute windows. If the request rate over that window exceeds a very generous threshold I've only seen a few very obvious and poorly behaved crawlers exceed it fires an alert.

The alert kicks off a flow where we look up the ASN covering every IP in that range, look up every range associated with those ASNs, and throw an alert in Slack with a big red "Block" button attached. When approved, the entire ASN is blocked at the edge.

It's never triggered on anything we weren't willing to block (e.g., a local consumer ISP). We've dropped a handful of foreign providers, some "budget" VPS providers, some more reputable cloud providers, and Facebook. It didn't take long before the alerts stopped--both for high request rates and our application monitoring seeing excessive loads.

If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip


> If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip

What exactly is the source of these mappings? Never heard about ipverse before, seems to be a semi-anonymous GitHub organization and their website has had a failing certificate for more than a year by now.


whois (delegation files) according to the embedded blog post, eg https://ftp.arin.net/pub/stats/arin/delegated-arin-extended-...

You ban the ASN permanently in this scenario?

So far, yes.

I could justify it a number of ways, but the honest answer is "expiring these is more work that just hasn't been needed yet". We hit a handful of bad actors, banned them, have heard no negative outcomes, and there's really little indication of the behaviour changing. Unless something shows up and changes the equation, right now it looks like "extra effort to invite the bad actors back to do bad things" and... my day is already busy enough.


i don't know. use PAT. the long term solution is web environment integrity by another name.

And by a company which isn't knee deep in this itself.

It depends what your goal is.

Having to use a browser to crawl your site will slow down naive crawlers at scale.

But it wouldn't do much against individuals typing "what is a kumquat" into their local LLM tool that issues 20 requests to answer the question. They're not really going to care nor notice if the tool had to use a playwright instance instead of curl.

Yet it's that use-case that is responsible for ~all of my AI bot traffic according to Cloudflare which is 30x the traffic of direct human users. In my case, being a forum, it made more sense to just block the traffic.


Maybe a stupid question but how can Cloudflare detect what portion of traffic is coming from LLM agents? Do agents identify themselves when they make requests? Are you just assuming that all playwright traffic originated from an agent?

That is what Cloudflare's bot metrics dashboard told me before I enabled their "Super Bot Fighter" system that brought traffic back down to its pre-bot levels.

I assume most traffic comes from hosted LLM chats (e.g. chatgpt.com) where the provider (e.g. OpenAI) is making the requests from their own servers.


I'm curious about whether there are well coded AI scrapers that have logic for "aha, this is a git forge, git clone it instead of scraping, and git fetch on a rescrape". Why are there apparently so many naive (but still coded to be massively parallel and botnet like, which is not naive in that aspect) crawlers out there?

If they're handling it as “website, don't care” (because they're training on everything online) they won't know.

If they're treating it specifically on “code forge” (because they're after coding use cases), there's lots of interesting information that you won't get by just cloning a repo.

It's not just the current state of the repo, or all commits (and their messages). It's the initial issue (and discussion) that lead to a pull request (and review comments) that eventually gets squashed into a single commit.

The way you code with an agent is a lot more similar to the: issue, comments, change, review, refinement sequence; that you get by slurping the website.


I'm not an industry insider and not the source of this fact, but it's been previously stated that traffic costs to fetch the current data for each training run is cheaper then caching it in any way locally - wherever it's a git repo, static sites or any other content available through http

This seems nuts and suggests maybe the people selling AI scrapers their bandwidth could get away with charging rather more than they do :)

I'd see this as coming down to incentive. If you can scrape naively and it's cheap, what's the benefit to you in doing something more efficient for git forge? How many other edge cases are there where you could potentially save a little compute/bandwidth, but need to implement a whole other set of logic?

Unfortunately, this kind of scraping seems to inconvenience the host way more than the scraper.

Another tangent: there probably are better behaved scrapers, we just don't notice them as much.


True, and it doesn't get mentioned enough. These supposedly world-changing advanced tech companies sure look sloppy as hell from here. There is no need for any of this scraping.

I guess they're vibe coded :D

what's next: you can only read my content after mining btc and wiring it to $wallet->address

Postgres’s extensible index AM story doesn’t get enough love, so it’s nice to see someone really lean into it for LIKE. Biscuit is basically saying: “what if we precompute an aggressive amount of bitmap structure (forward/backward char positions, case-insensitive variants, length buckets) so most wildcard patterns become a handful of bitmap ops instead of a heap scan or bitmap heap recheck?” That’s a very different design point from pg_trgm, which optimizes more for fuzzy-ish matching and general text search than for “I run a ton of LIKE '%foo%bar%' on the same columns”.

The interesting question in prod is always the other side of that trade: write amplification and index bloat. The docs are pretty up-front that write performance and concurrency haven’t been deeply characterized yet, and they even have a section on when you should stick with pg_trgm or plain B-trees instead. If they can show that Biscuit stays sane under a steady stream of updates on moderately long text fields, it’ll be a really compelling option for the common “poor man’s search” use case where you don’t want to drag in an external search engine but ILIKE '%foo%' is killing your box.


Wouldn't tsvector, tsquery, ts_rank, etc. be Postgres's "poor man's search" solution? With language-aware stemming they don't need to be as aggressive with writing to indexes as you describe Biscuit above.

But if you really need to optimize LIKE instead of providing plain text search, sure.


Investors talk about HN like it’s a growth lever, but the site mostly behaves like a long-running reading habit with spam defenses. A tiny slice of users on /new decide whether you even get a shot, gravity slowly pushes old stuff down, there’s a “second chance” queue for posts that looked promising but died early, and moderators occasionally hand-tune obvious mistakes. Beyond that, it’s just a bunch of curious people clicking what looks interesting.

The only repeatable “strategy” I’ve seen work is: write things that would be interesting even if HN didn’t exist, and let other people submit them. Trying to treat HN as a distribution channel (carefully timed posts, optimized titles, orchestrated upvotes) reliably backfires because the software + mods are explicitly optimized against that. If you treat it as a weird little newspaper run by nerds for their own curiosity, the dynamics suddenly make a lot more sense.


>If you treat it as a weird little newspaper run by nerds for their own curiosity

That's my favorite phrase of this entire thread, which I'm reading in its entirety.


I buy the economics argument, but I’m not sure “mainstream formal verification” looks like everyone suddenly using Lean or Isabelle. The more likely path is that AI smuggles formal-ish checks into workflows people already accept: property checks in CI, model checking around critical state machines, “prove this invariant about this module” buttons in IDEs, etc. The tools can be backed by proof engines without most engineers ever seeing a proof script.

The hard part isn’t getting an LLM to grind out proofs, it’s getting organizations to invest in specs and models at all. Right now we barely write good invariants in comments. If AI makes it cheap to iteratively propose and refine specs (“here’s what I think this service guarantees; what did I miss?”) that’s the moment things tip: verification stops being an academic side-quest and becomes another refactoring tool you reach for when changing code, like tests or linters, instead of a separate capital-P “formal methods project”.


We’ve had variations of “JSON describes the screen, clients render it” for years; the hard parts weren’t the wire format, they were versioning components, debugging state when something breaks on a specific client, and not painting yourself into a corner with a too-clever layout DSL.

The genuinely interesting bit here is the security boundary: agents can only speak in terms of a vetted component catalog, and the client owns execution. If you get that right, you can swap the agent for a rules engine or a human operator and keep the same protocol. My guess is the spec that wins won’t be the one with the coolest demos, but the one boring enough that a product team can live with it for 5-10 years.


A decade of “personal cloud box” attempts has shown that the hard part isn’t the hardware, it’s the long-term social contract. Synology/WD/My Cloud/etc all eventually hit the same wall: once the company pivots or dies, you’re left with a sealed brick that you don’t fully control, holding the most irreplaceable thing you own: your data. If you’re going to charge an Apple-like premium on commodity mini-PC hardware, you really have to over-communicate what happens if Umbrel-the-company disappears or changes direction: how do I keep using this thing in 5–10 years without your cloud, your app store, your updates?

The interesting opportunity here isn’t selling a fancy N100 box, it’s turning “self-hosted everything” into something your non-technical friend could actually live with. That’s mostly about boring stuff: automatic off-site backup that isn’t tied to one vendor, painless replacement/restore if the hardware dies, and clear guarantees about what runs locally vs phoning home. If Umbrel leans into being forkable and portable across generic hardware, it has a shot at being trusted infrastructure instead of just another pretty NAS that people regret once the marketing site goes dark.


Don't forget the user experience needs to be seamless. We bubble ourselves to this as tech fluent folks on HN, but the seamless quality needs to be on par or better with Google Drive, iCloud drive, Google / iCloud Photos etc.

Ability to share, good default security, and seamless integration with the things people care about.

If this device can't automatically backup a phone wirelessly and without my interaction, it will be a poor proposition to most people.

We would all have been better off fiercely advocating for open protocols for all this stuff first (forced interop), but technologists have not wanted to wade into that in a sustained, en masse way


There is no way to properly make money from fully open protocols. If you do the hard work of research and development, your competitors can just take the work and sell their implementation minus the R&D costs, undercutting you. It's not sustainable.

It's basically what Apple learned during the Macintosh clones era. Churning out countless units of the same stuff isn't that complicated once you have figured out what needs to be copied. Getting the worth-copying state is the hard and expensive part; nobody is going to do it for free.

This can readily be seen in the "free" open-source software world. The vast majority of it is just lower-quality copies of existing software.


I've tried a lot of personal cloud options (ownCloud, a Resilio Sync mesh, CloundMounter + B2) and somehow ended up back on iCloud because of this.

My next experiment is just to use NFS over Nebula/Tailscale and see how much data I can just host off my NAS, but it's surprisingly been quite a journey for a simple problem.


You can't really switch away from iCloud without sacrificing it's deep integration.

The whole whole ecosystem is designed around it.

Don't get me wrong, Apple could've written their software with different upstream options, but they choose not to - hence going away from iCloud forces you to give up on a lot of features

I'm just pointing this out because if you've already attempted different options and went back to iCloud, then trying others isn't likely to be worthwhile, honestly.

You'd first have to accept that moving away from it means sacrificing features such as the photos sync (including delete etc).


That for me has been Dropbox. It's not even a shadow of what it used to be as a sleek, perfect sync tool, but the competition is so bad and getting worse every day (along with Dropbox) that "Dropbox + Cryptomator" is literally the best option I still have. Tresorit seemed to come close, but it's bug-ridden and really sluggish, and their support is painfully useless.

And as someone who has been in Apple's hardware ecosystem for more than a decade now (almost exclusively), I can't in my right mind bring myself to use any of its software/service products (and for good reasons, seeing it go bad to very bad to downright pathetic over the years) except for the OS because that's not really an option. Yes, I do have a small Cryptomator folder syncing to iCloud as well, but that's just because I wanted to have that as a backup sync, and it's a very tiny set of data that I anyway backup to elsewhere.

The bad of it? Yes, keeping everything under one roof really feels simple and easy.

The good? If Apple blocks my a/c today or nukes it, it will take a few hours to few days but I will get back everything single piece of data I have online on a new laptop or phone (Apple or Android or Windows or Linux) - everything! And it's a joy to use specific better/superior options for your software/service needs as per your specific choice!


Even as a techie, I prefer and use iCloud for exactly this reason, especially for stuff I share with family. I don't want me to be the bottleneck for what is considered basic functionality these days.


That's exactly my goal with HomeFree:

https://homefree.host

Goal is my mom running it, and keeping it 100% open source.

It looks like there isn't a lot of visible progress, but there's now a branch with a live CD installer, and an admin UI, so no command line shenanigans are necessary. Once that is cleaned up, the website will be refreshed.

I really need to quit my job so I can work on this full time.


What will make development sustainable? I mean it could take some time until it gets trackson and also usually open source works if there is a supporting company behind it.


I am going to get it to a point where moderately technical people would be happy to use it over other options, and build a community that contributes. I will continue to work on making it easier to use over time.


HomeFree must be deployed from another machine with Nix installed.

Your mom runs Nix?


No, if you use the installer CD it's a fully UI based install. No command line. No awareness it's even running nix.

Administration is through a web UI.

It could easily be pre-installed on a device like a NUC and delivered to my mom.

Did you read the FAQ?

I've got all this running on a branch but it's still rough. Once it's relatively stable it will be merged to master and the home page completely revamped.


No I didn’t read the FAQ, I read the get started section.

> Get a Nix environment set up on your host machine. HomeFree must be deployed from another machine with Nix installed. If you are on NixOS, you should be good to go. If you are on a different distribution of Linux, follow the instructions on how to install and use Nix.


Right, the published version is a super early technical preview. As mentioned in my top level comment the installer and admin UI are in a branch and I will update the website once these are released.


Hi Umbrel CTO and cofounder here, appreciate the thoughtful feedback.

> how do I keep using this thing in 5–10 years without your cloud, your app store, your updates?

The code is publicly available with a non-commercial restriction. If Umbrel the company disappears it's possible for a community maintained fork to live on. Someone else in this discussion mentioned that the NC clause hurts maintainability due to no future company being able to profit from taking over maintenance. They suggested we add a clause revoking the NC restriction if Umbrel goes out of business. It's a good suggestion and something we'll definitely consider, I think it should be possible.

Regarding apps specifically, we have the concept of "community app stores". Anyone can host their own app store which is just a public git repo that any other user can use by pasting it's url into their web ui once. Community app stores completely bypass our main app store, they don't rely on our infrastructure and will continue working if we disappear. There are already hundreds of community app stores in use:

- https://github.com/getumbrel/umbrel-community-app-store/fork... - https://github.com/search?q=in%3Areadme+sort%3Aupdated+-user...

> automatic off-site backup that isn’t tied to one vendor, painless replacement/restore if the hardware dies

We recently shipped backups baked directly into umbrelOS. You can backup to a local NAS, USB device, or another Umbrel (local or remote). You can restore individual files from hourly/weekly/monthly snapshots, or restore the entire state of your Umbrel onto a fresh device from your backups.

https://x.com/umbrel/status/1970508327479320862

> portable across generic hardware

We currently support running on Raspberry Pi, all amd64 devices, virtual machines and there is unofficial support for running in Docker.

> The interesting opportunity here isn’t selling a fancy N100 box, it’s turning “self-hosted everything” into something your non-technical friend could actually live with.

I completely agree, that's the plan.


Sorry, isn't this running an open-source OS? The header has a link to a github with a non-commercial license[0].

If so, couldn't you just use the OS on non-premium-priced mini-PC hardware and never have to worry about them locking you out of your box? I guess maybe it's concerning if you're being forced to update by the OS? I've never actually run a system like that, but was considering umbrel OS (didn't actually know about the hardware until this post), so if I'm being naive about something, it's in earnest.

[0] https://github.com/getumbrel/umbrel


A non-commercial license prevents it from being open-source, and I think already constitutes extremely clear communication about what will happen to users when Umbrel goes bankrupt: they will be stranded, because the license doesn't allow another company to step up and take over maintenance the way an open-source license would.


these companies - if they are so afraid of an OSI approved license - should put certain conditions into their that trigger when they go out of business and the IP gets released


This is a good suggestion, we're taking a look into it.


I’m not worried about “can I, personally, keep this thing running?” so much as “what is the long-term story for the kind of person who buys a turnkey appliance”.

Yes, Umbrel OS is on GitHub and you can already run it on generic NUCs / Pi etc. That’s great. But the value prop of the hardware is the whole bundle: curated apps, painless updates, maybe remote access, maybe backups. If Umbrel-the-company pivots or withers, the repo still being there under a non-commercial license doesn’t guarantee ongoing maintenance, an app store, or support. And the NC clause is exactly what makes it hard for someone else to step in and sell a fully supported forked “Umbrel but maintained” box to non-technical users. So for people like you and me, sure, we can just install it elsewhere; for the target audience of an expensive plug-and-play box, the long-term social contract is still the fragile part.


Ah, okay, yeah, I get you now. I could get behind a splashy section about how users can "walk away at any time" with a roadmap that seems reasonable. I think that fits in with the general ethos of what these things should offer to consumers. I can certainly see why a company wouldn't be keen to advertise "if we die, here's what you can do.", but a way to tell consumers how to gracefully exit doesn't seem so antithetical to a marketing plan, and personally, knowing they've given me an off-ramp does make me more likely to use a thing.


I run umbrel in a VM . For non fiat finops stuff.

I also run Cloudron on a VPS.

I wish both of those solutions had more mindshare. They save me so much time and effort. Especially Cloudron!


I looked at Cloudron and I'm not sure why I would choose this over just throwing in Proxmox on a box and start clicking stuff in their 'app store'.


Proxmox = infra. You run ops. Cloudron = platform. Ops is mostly done. Clicking apps is easy. Maintaining them isn’t.


Right, but with helper-scripts "ops is mostly done" on Proxmox too. You just point at them and perhaps follow some instructions and that's it.


Helper scripts automate day 0. Cloudron automates day 2+.

Install ≠ operate


> you’re left with a sealed brick that you don’t fully control

Totally agreed. I had seen umbrel and others in the past but recently decided to just get a 4-bay m.2 ssd enclosure (using RAID 1 for 2 sets of 2), not a NAS (after previously having a Synology NAS). I only want pure file access in a small, quiet form factor and I can have another Mac host and cloud backup. Currently using Tailscale Drive (alpha feature) to share it with devices and working pretty well so far.

https://x.com/Stammy/status/2000355524429402472


I think the problem is IPS-provided routers being locked down. Alternatively, IPv6 availability and support. Alternatively, static residential IPv4 availability. Alternatively, dynamic dns services which always require a subscription to use your own domains.


this can be solved by adding an external nas - for redundancy - and an opensource application or extension that manages the syncing?

making self hosting more seamless is key, we simply can't trust to be dependent on third parties for access to our own data in the long term


If you already have a NAS I’m not sure what this does for you that just getting a bigger NAS wouln’t?


Isn't Umbrel mostly open source and Docker-based?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: