Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I went down a rabbit hole and found most of the missing lists on Common Crawl: https://mirandrom.github.io/bourdain-lists/

Unfortunately, AFAICT, the embedded image data were not included in the Common Crawl scrapes, and a few of the image URLs I tried don't seem indexed by Common Crawl. I only just started playing around with these tools so I might've missed something.



Common Crawl is a text-only crawl.


I'm not so sure, they say "The crawled content is dominated by HTML pages and contains only a small percentage of other document formats." https://commoncrawl.github.io/cc-crawl-statistics/plots/mime...

In any case, all the images were external cloudfrount URLs that have not been archived anywhere afaict.


Hi. I'm the CTO at Common Crawl. Nice to meet you. There's a small amount of "bycatch", and you already discovered how to see it. Notice that it went down after I was hired.


Can't argue with those credentials. Thanks for confirming/clarifying!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: