Hacker Newsnew | past | comments | ask | show | jobs | submit | AnbeSivam's commentslogin

Do you know of any blog/papers which talks about this - using topology for such interval data types.



Came across someone else mentioning the similar bandwidth constraint w.r.to LFB per core a month back.

https://news.ycombinator.com/item?id=25221968


Does your spider face issues with cloudflare mentioned by Gigablast founder here.

https://www.gigablast.com/blog.html


We are not aware that we have any problem like that.

This might explain why GigaBlast has a problem:

Because of bugs in the original Gigablast spidering code, the Findx crawler ended up on a blacklist in Project Honeypot as being “badly behaved” (fixed in our fork). That meant quite a bit of trouble for us because CDN providers, which are a very powerful hubs for internet traffic, put a lot of weight on this blacklist. Some of the most popular websites and services on the internet run through services like Cloudflare and other CDNs – so if you are in bad standing with them, suddenly a large part of the internet is not available, and we weren’t able index it.

extract from: https://web.archive.org/web/20190921180535/https://privacore...


> fixed in our fork

Does this mean your spider is a fork of Gigablast? Is there some additional interesting technical information about how your code/infrastructure is set up?


I realise this is not addressing your second question but you might find it interesting. Post below on server expansion one year ago. We are adding another 100 servers over Christmas and early new year.

https://blog.mojeek.com/2019/12/100-server-build-and-install...

We'll be writing about our tech stack in our next FAQs series; 3 of 4, this is 1 of 4:

https://blog.mojeek.com/2020/11/frequently-asked-questions-a...


No, we have our own spider


The post was from Findx not Mojeek


any insight as to what is consider "Badly behaved crawlers"? Or is it something that you work out with CDN so they don't blacklist your ip?


Hello, I work on the technical side of Mojeek.

Mojeek follows the robots.txt protocol so if a site doesn't want to be crawled by MojeekBot we respect that wish. There is also a generous crawl delay between pages on the same host.

Generally a 'badly behaved bot' will ignore robots.txt or hit a site too hard with requests.

Our bot uses a specific user agent which you can verify via DNS. https://www.mojeek.com/bot.html


> There is also a generous crawl delay between pages on the same host.

What's the order of magnitude of this delay? milliseconds? hundreds of milliseconds? seconds? I'm curious what's considered 'polite' in this realm and how the various parties come to form opinions on this.


A minimum of 4 seconds.


I just had a look and there's a non-standard "crawl-delay directive" extension to robots.txt that can be used to ask a spider to take some time between page visits:

  User-agent: bingbot
  Allow : /
  Crawl-delay: 10
https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...


Hello, MojeekBot doesn't observe the crawl-delay directive but thanks for the reminder of it as it's beneficial for us to know if site owners require more grace between requests.


Hey. Good job with Mojeek. It seems the crawl-delay directive is not part of the robots.txt standard. It probably should be but that's not up to you!


Thanks for that link, I haven't come across it before.


Anyone following IR related news, do you know what happened to BitFunnel (opensourced rewrite of Bing search engine).

https://bitfunnel.org/categories/blog/

https://github.com/bitfunnel/nativejit/

https://github.com/BitFunnel/BitFunnel


I wasn't familiar with this project. Thanks for mentioning it.




Thanks, I was going to ask what is the back story here.


From the article -

> They claimed 10x the density of DRAM, it is now 4x

> Latency missed by 100x, yes one hundred times, on their claim of 1000x faster, 10x is now promised

> More troubling is endurance, probably the main selling point of this technology over NAND. Again the claim was a 1000x improvement, Intel delivered 1/333rd of that with 3x the endurance.

From this seminar few months back - https://www.youtube.com/watch?v=hXurTRtmfWc ,

I think density can be increased, this is only the initial product,

and latency is contributed more by PCIe/OS/application rather than the underlying 3d-xpoint material. The slides from the article are for the PCIe SSDs, I wonder whether the earlier claimed latency, still holds well with NVRAM.

I wonder why the endurance is so lower than the earlier claims.


Ouch!

10x latency and 3x endurance might normally satisfy the "must be 10x better" criteria to break into an existing market, but with the maturity of flash, and how memory hierarchies can ameliorate useful sets of latency requirements, this could end up being a damp squib instead of the revolution promised. 1000x endurance would have been great, 3x, who will notice?

Not the first time Intel has grossly mismanaged its technology....


"memory hierarchies can ameliorate useful sets of latency requirement"

Not the latency for commits to stable non-volatile storage, unless battery backed RAM is an option.


Or just a big enough capacitor to finish the necessary writes to flash memory. Which as I understand it is one of the things that distinguishes enterprise from consumer flash drives, and one of the reasons I use the slowest, smallest Intel enterprise flash drive for system and /home.


It seems to be a case of transitioning marketing claims from those about the potential of the core underlying technology to more real world scenario benefits. Some of the numbers included latency in the kernel/driver, so they are more focused on actual applications.

It is a bit different to say initial product shipped vs tech potential.. We've been waiting on zen err bulldozer/excavator/piledriver/steamroller for years now, and while mobile and Apus shipped, it has been a fluke in server and desktop markets.


While certainly not breathtaking, it's an initial product on a development path. I was expecting more, but this is still an improvement.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: