More

trebor · 2025-04-29T14:17:40 1745936260

Ignoring tariffs for the moment... No one should be shocked that Amazon would do this.

Remember back when it was controversial to apply sales tax to online shopping? The biggest lobbyist against it was Amazon, which they marketed as fighting for the market against the government. Then they got big enough to survive a reversal ... and swung the other way to weaponize the law against their smaller competitors.

They've done a lot more than this, including creating their own brands to inject into a successful niche product or segment. They did all that off of sales/product data they aggregated from all sales on their platform.

Amazon is a dirty player in the market, and everyone should remember that.

Tagbert · 2025-04-29T14:45:40 1745937940

This is not really a related issue other than "tax". Amazon's policy around applying sales taxes is just a distraction for this issue.

m463 · 2025-04-30T02:25:09 1745979909

> controversial to apply sales tax to online shopping

You know, they had a point though. California wanted a washington corporation to collect taxes for it.

They could not fight it because they were in another state.

It struck me as basically "no taxation without representation".

AlecSchueler · 2025-04-29T17:11:35 1745946695

But previously it also seemed like Amazon leadership was cosy with the administration.

trebor · on Jan 18, 2025

Upvoted because we’re seeing the same behavior from all AI and Seo bots. They’re BARELY respecting Robots.txt, and hard to block. And when they crawl, they spam and drive up load so high they crash many servers for our clients.

If AI crawlers want access they can either behave, or pay. The consequence will almost universal blocks otherwise!

mschuster91 · on Jan 18, 2025

Global tarpit is the solution. It makes sense anyway even without taking AI crawlers into account. Back when I had to implement that, I went the semi manual route - parse the access log and any IP address averaging more than X hits a second on /api gets a -j TARPIT with iptables [1].

Not sure how to implement it in the cloud though, never had the need for that there yet.

[1] https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62f...

jks · on Jan 18, 2025

One such tarpit (Nepenthes) was just recently mentioned on Hacker News: https://news.ycombinator.com/item?id=42725147

Their site is down at the moment, but luckily they haven't stopped Wayback Machine from crawling it: https://web.archive.org/web/20250117030633/https://zadzmo.or...

marcusb · on Jan 18, 2025

Quixotic[0] (my content obfuscator) includes a tarpit component, but for something like this, I think the main quixotic tool would be better - you run it against your content once, and it generates a pre-obfuscated version of it. It takes a lot less of your resources to serve than dynamically generating the tarpit links and content.

0 - https://marcusb.org/hacks/quixotic.html

kazinator · on Jan 18, 2025

How do you know their site is down? You probably just hit their tarpit. :)

bwfan123 · on Jan 18, 2025

i would think public outcry by influencers on social media (such as this thread) is a better deterrent, and also establishes a public datapoint and exhibit for future reference.. as it is hard to scale the tarpit.

idlewords · on Jan 18, 2025

This doesn't work with the kind of highly distributed crawling that is the problem now.

seethenerdz · on Jan 19, 2025

Don't we have intellectual property law for this tho?

herpdyderp · on Jan 18, 2025

> The consequence will almost universal blocks otherwise!

How? The difficulty of doing that is the problem, isn't it? (Otherwise we'd just be doing that already.)

ADeerAppeared · on Jan 18, 2025

> (Otherwise we'd just be doing that already.)

Not quite what the original commenter meant but: WE ARE.

A major consequence of this reckless AI scraping is that it turbocharged the move away from the web and into closed ecosystems like Discord. Away from the prying eyes of most AI scrapers ... and the search engine indexes that made the internet so useful as an information resource.

Lots of old websites & forums are going offline as their hosts either cannot cope with the load or send a sizeable bill to the webmaster who then pulls the plug.

gundmc · on Jan 18, 2025

What do you mean by "barely" respecting robots.txt? Wouldn't that be more binary? Are they respecting some directives and ignoring others?

unsnap_biceps · on Jan 18, 2025

I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.

That counts as barely imho.

I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.

joecool1029 · on Jan 18, 2025

Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

SR2Z · on Jan 18, 2025

IA actually has technical and moral reasons to ignore robots.txt. Namely, they want to circumvent this stuff because their goal is to archive EVERYTHING.

prinny_ · on Jan 18, 2025

Isn’t this a weak argument? OpenAI could also say their goal is to learn everything, feed it to AI, advance humanity etc etc.

compootr · on Jan 18, 2025

OAI is using others' work to resell it in models. IA uses it to presrrve the history of the web

there is a case to be made about the value of the traffic you'll get from oai search though...

SR2Z · on Jan 24, 2025

It does depend a lot on how you feel about IA's integrity :P

amarcheschi · on Jan 18, 2025

I also don't think they hit servers repeatedly so much

AnonC · on Jan 19, 2025

As I recall, this is outdated information. Internet Archive does respect robots.txt and will remove a site from its archive based on robots.txt. I have done this a few years after your linked blog post to get an inconsequential site removed from archive.org.

dredmorbius · on Jan 19, 2025

The most recent notice IA have blogged was in 2017, and there's no indication that the service has reversed course on robots.txt since.

<https://blog.archive.org/?s=robots.txt>

noman-land · on Jan 18, 2025

This is highly annoying and rude. Is there a complete list of all known bots and crawlers?

jsheard · on Jan 18, 2025

https://darkvisitors.com/agents

https://github.com/ai-robots-txt/ai.robots.txt

LukeShu · on Jan 18, 2025

Amazonbot doesn't respect the `Crawl-Delay` directive. To be fair, Crawl-Delay is non-standard, but it is claimed to be respected by the other 3 most aggressive crawlers I see.

And how often does it check robots.txt? ClaudeBot will make hundreds of thousands of requests before it re-checks robots.txt to see that you asked it to please stop DDoSing you.

mariusor · on Jan 19, 2025

One would think they'd at least respect the cache-control directives. Those have been in the web standards since forever.

Animats · on Jan 18, 2025

Here's Google, complaining of problems with pages they want to index but I blocked with robots.txt.

    New reason preventing your pages from being indexed

    Search Console has identified that some pages on your site are not being indexed 
    due to the following new reason:

        Indexed, though blocked by robots.txt

    If this reason is not intentional, we recommend that you fix it in order to get
    affected pages indexed and appearing on Google.
    Open indexing report
    Message type: [WNC-20237597]

smarnach · on Jan 19, 2025

They are not complaining. You configured Google Search Console to notify you about problems that affect the search ranking of your site, and that's what they do. if you don't want to receive these messages, turn them off in Google Search Console.

ksec · on Jan 18, 2025

Is there some way website can sell those Data to AI bot in a large zip file rather than being constantly DDoS?

Or they could at least have the curtesy to scrap during night time / off peak hours.

jsheard · on Jan 18, 2025

No, because they won't pay for anything they can get for free. There's only one situation where an AI company will pay for data, and that's when it's owned by someone with scary enough lawyers to pressure them into paying up. Hence why OpenAI has struck licensing deals with a handful of companies while continuing to bulk-scrape unlicensed data from everyone else.

tredre3 · on Jan 18, 2025

There is project whose goal is to avoid this crawling-induced DDoS by maintaining a single web index: https://commoncrawl.org/

seethenerdz · on Jan 19, 2025

Is existing intellectual property law not sufficient? Why aren't companies being prosecuted for large-scale theft?

Vampiero · on Jan 18, 2025

> The consequence will almost universal blocks otherwise!

Who cares? They've already scraped the content by then.

jsheard · on Jan 18, 2025

Bold to assume that an AI scraper won't come back to download everything again, just in case there's any new scraps of data to extract. OP mentioned in the other thread that this bot had pulled 3TB so far, and I doubt their git server actually has 3TB of unique data, so the bot is probably pulling the same data over and over again.

xena · on Jan 18, 2025

FWIW that includes other scrapers, Amazon's is just the one that showed up the most in the logs.

_heimdall · on Jan 18, 2025

If they only needed a one-time scrape we really wouldn't be seeing noticeable not traffic today.

seethenerdz · on Jan 19, 2025

That's the spirit!

emmelaich · on Jan 18, 2025

If they're AI bots it might be fun to feed them nonsense. Just send hack megabytes of "Bezos is a bozo" or something like that. Even more fun if you could cooperate with many other otherwise-unrelated websites, e.g. via time settings in a modified tarpit.

seethenerdz · on Jan 19, 2025

Don't worry, though, because IP law only applies to peons like you and me. :)

trebor · on Nov 13, 2024

Okay, but why not work on making atmospheric methane more useful/practical? CO2 is less of a warming influence than methane, and there have been huge natural gas leaks (of methane) in the last 10-20 years. Even MIT admits that Methane is more important: https://climate.mit.edu/ask-mit/what-makes-methane-more-pote...

trebor · on Oct 30, 2024

So this is just an advertisement for your company's services? Noted...

trebor · on Oct 22, 2024

Upvoted because your ToS is very clear about the images we scale aren't used to improve training. Thank you for that.

I have some pretty old photos I may have to pull out and upscale. They're from back in the old 1MP camera days.

trebor · on Oct 15, 2024

So 5,000 qbits to crack a 50 bit prime key. That’s an interesting factor. Assuming a similar scale, 204,800 qbits would crack RSA 2048 keys. I’m curious why it scales to needing “millions” according to those researchers.

trebor · on Oct 4, 2024

Fruit trees can take a very long time to mature and grow flowers/fruit. Hopefully it can pollinate with itself, and isn’t a kind that has sexes.

Doxin · on Oct 4, 2024

Worst case you can clone it and GMO it into a different sex. Wouldn't be easy or cheap, but it's definitely something that's possible.

trebor · on Sept 27, 2024

I still believe Recall to be a very bad idea, a feature no one wanted, and a risk even as an Opt-In choice. But at least it will be harder to access.

I want to see a security researcher play with it though.

So far, I'm not a fan of Windows 11, Windows+AI, etc.

trebor · on Sept 26, 2024

I have used and developed in Wordpress since 3.2. Mullenweng is a dictator and maverick, and I’m not convinced that he’s good for the Wordpress ecosystem.

But neither are highly customized WP hosting platforms.

Revisioning, especially since the post_meta table was added, is a huge burden on the DB. I’ve seen clients add 50 revisions, totaling thousands of revisions and 200k post meta entries. Important enough to call disabling it by default a “cancer”? Chill out Matt.

Revisions aren’t relevant past revision 3-5.

orf · on Sept 26, 2024

What database is burdened by 200k rows? That’s tiny.

trebor · on Sept 26, 2024

It’s the excess, unaccessed content. The indexes haven’t been well optimized in MySQL (MariaDB is better).

But still. A lot of small companies only pay $20/mo for hosting …

orf · on Sept 26, 2024

But a database can handle tens of millions of rows with those resources.

If you’re worried about excess, why even use Wordpress? My god - serving rarely updated static content with a database? Stupid. The entire thing is excessive and wasteful.

trebor · on Sept 27, 2024

Maybe you misunderstand the market my employer serves?

We've built sites for clients both huge and small. Our clients like WordPress because it's well supported, easy to roll out, and easy to find someone to work on. Lots of people have experience with it, from having their own blog.

Even infrequently updated content can go through a logjam of revisions. And this is the failure of WordPress's versioning model: there's no way to "check in" a revision. So you can't mark what's approved/reviewed. Instead of a "check in" that nukes the middle revisions ... now you have 5-7 revisions where someone updated a button text.

Add in ACF fields (which uses approx 2.5 rows in post_meta PER FIELD) and now you've got a complex page with lots of rows that build it up. Now each "revision" is 1 post row + 20 - even 1000 post_meta rows. You see the problem.

Over time the DB bloats, and the index isn't partitioned in any way. That page a cheap dev built to query something? Runs 5x slower after a year, and the client doesn't know why.

The only reason we use WP over other platforms is: support, maintenance, but most importantly to the client COST.

trebor · on Aug 8, 2024

Wait, so they added 30g of erythritol and observed negative impacts? Most the time I see it around 7g plus other sweeteners. It’s almost never alone now.

I bet they need to look at 30-45g sugar harder now. Maybe uncover some of that (big sugar suppressed) research that it causes heart disease.