More

stummjr · on June 20, 2020

This repo may be helpful to you: https://github.com/satwikkansal/wtfpython

It’s a collection of snippets to explain behaviors that may be considered unexpected.

zepearl · on June 20, 2020

Thank you

stummjr · on Nov 3, 2019

I disagree with this article in so many ways. I have a CS degree and I work with many people who don’t, and they are just as good as (or even better than) me at all the points raised by the article.

They meet deadlines, they are incredibly good at communication and collaboration and they have pretty good networking. Most of these traits come from the fact that they needed to develop them in order to succeed in learning by themselves.

It is a pretty limited view of the world to think that only college can bring you this. Immersing yourself in a coding bootcamp for some people means leaving the jobs they need to survive in order to have a better job in the future. I can’t imagine how being on college can teach more about meeting deadlines, teamwork, communication and perseverance than that.

I wish this article provided more facts to back its beliefs up.

stummjr · on May 6, 2017

Ngrok is just awesome! A huge shout out to the developers!

stummjr · on Aug 25, 2016

Crawl-delay is not in the standard robots.txt protocol, and according to Wikipedia, some bots have different interpretations for this value. That's why maybe many websites don't even bother defining the rate limits in robots.txt.

wumpus · on Aug 25, 2016

I was referring to an actual rate limit, not crawl-delay. For example, YouTube is pretty strict about rate limits:

http://www.bing.com/search?q=%22We+have+been+receiving+a+lar...

I agree that crawl-delay is rare, and often it's set too long so that it's impossible to fully crawl a site -- as if the webmaster set it up 10 years ago and never updated it as their site got faster and bigger.

stummjr · on Aug 25, 2016

That's kind of what Scrapy's AUTO_THROTTLE middleware does.

stummjr · on Aug 25, 2016

Yeah, but that's not just because of web scraping. Plagiarism has been an issue for centuries.

stummjr · on Aug 25, 2016

Scrapy is asynchronous, but it provides many settings that you can use to avoid DDoS a website, such as limiting the amount of simultaneous requests for each domain or IP address.

And yes, crawling politely requires a bit of effort from both ends: the crawler and the website.

stummjr · on July 22, 2016

Scrapinghub is 100% remote from day zero. Nowadays there are ~140 people spread around the world, covering almost all timezones.

stummjr · on June 29, 2016

Hey! I work for Scrapinghub. Feel free to ask any questions.

stummjr · on May 5, 2016

Hey, Valdir from Scrapinghub here! Feel free to ask any questions you might have about the platform.