Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Headless browser scraping is between 10x and 100x more resource intensive, even if you carefully block requests and cache resources.

Instead of setting up some kind of partnership with our vendors, where they just send us information or provide an API, we scrape their websites.

The old version ran in a hour, using one thread on one machine. Downloaded PDF's and extracted the values.

The new version is Selenium based, uses 20 cores, 300GB of memory, and takes all night to run. It does the same thing, but from inside of a browser.

As a bonus, the 'web scrapers' were blamed for every performance issue we had for a long time.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: