Hacker Newsnew | past | comments | ask | show | jobs | submit | stummjr's commentslogin

This repo may be helpful to you: https://github.com/satwikkansal/wtfpython

It’s a collection of snippets to explain behaviors that may be considered unexpected.


Thank you


I disagree with this article in so many ways. I have a CS degree and I work with many people who don’t, and they are just as good as (or even better than) me at all the points raised by the article.

They meet deadlines, they are incredibly good at communication and collaboration and they have pretty good networking. Most of these traits come from the fact that they needed to develop them in order to succeed in learning by themselves.

It is a pretty limited view of the world to think that only college can bring you this. Immersing yourself in a coding bootcamp for some people means leaving the jobs they need to survive in order to have a better job in the future. I can’t imagine how being on college can teach more about meeting deadlines, teamwork, communication and perseverance than that.

I wish this article provided more facts to back its beliefs up.


Ngrok is just awesome! A huge shout out to the developers!


Crawl-delay is not in the standard robots.txt protocol, and according to Wikipedia, some bots have different interpretations for this value. That's why maybe many websites don't even bother defining the rate limits in robots.txt.


I was referring to an actual rate limit, not crawl-delay. For example, YouTube is pretty strict about rate limits:

http://www.bing.com/search?q=%22We+have+been+receiving+a+lar...

I agree that crawl-delay is rare, and often it's set too long so that it's impossible to fully crawl a site -- as if the webmaster set it up 10 years ago and never updated it as their site got faster and bigger.


That's kind of what Scrapy's AUTO_THROTTLE middleware does.


Yeah, but that's not just because of web scraping. Plagiarism has been an issue for centuries.


Scrapy is asynchronous, but it provides many settings that you can use to avoid DDoS a website, such as limiting the amount of simultaneous requests for each domain or IP address.

And yes, crawling politely requires a bit of effort from both ends: the crawler and the website.


Scrapinghub is 100% remote from day zero. Nowadays there are ~140 people spread around the world, covering almost all timezones.


Hey! I work for Scrapinghub. Feel free to ask any questions.


Hey, Valdir from Scrapinghub here! Feel free to ask any questions you might have about the platform.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: