Nice list, but what would be the arguments for switching over from other librari...

jancurn · on July 9, 2024

The main advantage (for now) is that the library has a single interface for both HTTP and headless browsers, and bundled auto scaling. You can write your crawlers using the same base abstraction, and the framework takes care of this heavy lifting. Developers of scrapers shouldn't need to reinvent the wheel, and just focus on building the "business" logic of their scrapers. Having said that, if you wrote your own crawling library, the motivation to use Crawlee might be lower, and that's fair enough.

Please note that this is the first release, and we'll keep adding many more features as we go, including anti-blocking, adaptive crawling, etc. To see where this might go, check https://github.com/apify/crawlee

robertlagrant · on July 9, 2024

Can I ask - what is anti-blocking?

fullspectrumdev · on July 9, 2024

Usually refers to “evading bot detection”.

Detecting when blocked and switching proxy/“browser fingerprint”.

robertlagrant · on July 9, 2024

Is this a good feature to include? Shouldn't we respect the host's settings on this?

nlh · on July 9, 2024

It’s a fair and totally reasonable question but clashes with reality. Many hosts have data that others want/like to scrape (eBay, Amazon, Google, airlines, etc.) and they setup anti-scraping mechanisms to try and prevent scraping. Whether or not to respect those desires is a bigger question but not one for the scraping library - it’s one for those doing the scraping and their lawyers.

The fact is - many many people want to scrape these sites and there is massive demand for tools to help them do that, so if APIFY/Crawlee decide to take the moral ground and not offer a way around bot detection, someone else will.

thebytefairy · on July 9, 2024

Ah yes, the old 'if I don't build the bombs for them, someone else will'. I don't think this is taking the moral high ground, this is saying we don't care whether it's moral, there's demand and we'll build it.

jancurn · on July 9, 2024

There are many legitimate and legal use cases where one might want to circumvent blocking of bots. We believe that everyone has the moral right to access and fairly use non-personal publicly available data on the web the way they want, not just the way the publishers want them to. This is the core founding principle of the open web, which allowed the web to become what it is today.

BTW we continuously update this exhaustive post covering all legal aspects of web scraping: https://blog.apify.com/is-web-scraping-legal/

beeboobaa3 · on July 9, 2024

Thoughts on this law? https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A...

mnmkng · on July 9, 2024

It’s an “old” law that did not consider many intricacies of internet and the platforms that exist on it and it’s mostly made obsolete by EU case law, which has shrunk the definition of a protected database under this law so much that it’s practically inapplicable to web scraping.

(Not my opinion. I visited a major global law firm’s seminar on this topic a month ago and this is what they said.)

amarcheschi · on July 9, 2024

I'm not gonna feel bad if a corporation gets its data scraped (whenever it's legal to do so, and this is another kind of question I'm not knowledgeable enough to face) when they themselves try to scrape other companies' data

robertlagrant · on July 9, 2024

You seem to have a massive category error here. To my understanding, this is not only going to circumvent the scraping protection of companies that scrape other people's data.

BiteCode_dev · on July 9, 2024

Google and Amazon where built on scrapped data, who are you kidding?

robertlagrant · on July 9, 2024

There's a bidirectional benefit to Google at least. That's why SEO exists. People want to appear in search results.

nurettin · on July 9, 2024

I make sure to enroll in projects which scrape Google/Amazon en-masse just for the satisfaction.