Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.




For those that never looked at the CT logs: https://crt.sh/?q=ycombinator.com

(the site may occasionally fail to load)


Shameless plug :)

https://www.merklemap.com/search?query=ycombinator.com&page=...

Entries are indexed by subdomain instead of by certificate (click an entry to see all certificates for that subdomain).

Also, you can search for any substring (that was quite the journey to implement so it's fast enough across almost 5B entries):

https://www.merklemap.com/search?query=ycombi&page=0


Not 100% related but not 100% not-related either: I've got a script that generates variations of the domain names I use the most... All the most common typos/mispelling, all the "1337" variations, all the Levenhstein edit distance of 1, quite some of the 2, etc.

For example for "lillybank.com", I'll generate:

    llllybank.com
    liliybank.com
    ...
and countless others.

Hundreds of thousands of entries. They then are null-routed from my unbound DNS resolver.

My browsers are forced into "corporate" settings where they cannot use DoH/DoT: it's all, between my browsers and my unbound resolver, in the clear.

All DNS UDP traffic that contains any Unicode domain name is blocked by the firewall. No DNS over TCP is allowed (and, no, I don't care).

I also block entire countries' TLD as well as entire countries' IP blocks.

Been running a setup like that (and many killfiles, and DNS resolvers known to block all known porn and know malware sites etc.) since years now already. The Internet keeps working fine.


The first page of results doesn't include ycombinator.com. I get `app.baby-ycombinator.com`, `ycombinator.comchat.com`, everything in between.

Substring doesn't seem like what I'd want in a subdomain search.


> Substring doesn't seem like what I'd want in a subdomain search.

Well, if you want only subdomains search for *.ycombinator.com.

https://www.merklemap.com/search?query=*.ycombinator.com&pag...


Any insights you can share on how you made search so fast? What kind of resources does it take to implement it?

Most of merklemap is stored on ZeroFS [0] and thus allows to scale IO ressources quite crazily :)

[0] https://github.com/Barre/ZeroFS


How does ZeroFS handle consistency with writes?

If you use 9P or NBD it handles fsync as expected. With NFS, it's time based.

https://github.com/Barre/ZeroFS#9p-recommended-for-better-pe...


Oh awesome! I was searching for consistency, but I guess durability is the word used for filesystems. Thanks!

> Watch Ubuntu boot from ZeroFS

Love it


Thank you!!! Needed exactly this at work.

Glad it was helpful!

Considering how it must be getting hammered what with the "AI" nonsense, it's interesting how crt.sh continues to remain usable, particularly the (limited) direct PostgresSQL db access

To me, this is evidence that SQL databases with high traffic can be made directly accessible on the public internet

crt.sh seems to be more accessible at certain times of the day. I can remember when it had no such accessibility issues


It is not usable.

It's the only website I know of where queries can just randomly fail for no reason, and they don't even have an automatic retry mechanism. Even the worst enterprise nightmares I've seen weren't this user unfriendly.


With that said, given that (1) pre-certificates in the log are big and (2) lifetimes are shortening and so there will be a lot of duplicates, it seems like it would be good for someone to make a feed that was just new domain names.

There's an extension to static-ct-api, currently implemented by Sunlight logs, that provides a feed of just SANs and CNs: https://github.com/FiloSottile/sunlight/blob/main/names-tile...

For example:

  curl https://tuscolo2026h1.skylight.geomys.org/tile/names/000 | gunzip
(It doesn't deduplicate if the same domain name appears in multiple certificates, but it's still a substantial reduction in bandwidth compared to serving the entire (pre)certificate.)


These exist for apex domains; the real use-case is subdomains.

Sure, but the subdomains will be duplicated for the same reasons.

The Web Archive also uses the Certificate Transparency logs, some websites that aren't linked anywhere end up in the Wayback Machine this way: https://archive.org/details/certificate-transparency?tab=abo...

"... for exacty this reason."

Needs clarification. What reason


Knowing what DNS names are actually used.

EDIT: that's the flip side of supporting HTTPS that's not well-known among developers - by acquiring a legitimate certificate for your service to enable HTTPS, you also announce to the entire world, through a public log, that your service exists.


I don't really see how this is a flip-side. If you're putting something on the web, presumably you want it to be accessed by others, so this is actually a benefit.

If you didn't want others to access your service, maybe consider putting it in a private space.


There’s usages of https that don’t overlap with "the (public) web".

All of the internal stuff at $employer uses a private CA. I suspect this is fairly universal at places that aren't super tiny.

s/exacty/exactly

"I minted a new TLS cert and it seems that OpenAI is scraping CT logs for what I assume are things to scrape from, based on the near instant response from this:"

The reason presented by the blog post is "for what I assume are things to scrape from"

Putting aside the "assume" part (see below^1), is this also the reason that the other "systems" are "scraping" CT logs

After OpenAI "scrapes" then what does OpenAI do with the data (readers can guess)

But what about all the other "systems", i.e., parties that may use CT logs. If the logs are public then that's potentially a lot of different parties

Imagine in an age before the internet, telephone subscriber X sets up a new telephone line, the number is listed in a local telephone directory ("the phone book") and X immediately receives a phone call from telephone subscriber Z^2

X then writes an op-ed that suggests Z is using the phone book "for who to call"

This is only interesting if X explains why Z was calling or if the reader can guess why Z was calling

Anyone can use the phone book, anyone can use ICANN DNS, anyone can use CT logs, etc.

Why does someone use these public resources. Online commenter: "To look up names and numbers"

Correct. But that alone is not very interesting. Why are they looking up the names and numbers

1.

We can make assumptions about why someone is using a public resource, i.e., what they will use the data for. But that's all they are: assumptions

With the telephone, X could ask "Why are you calling?"

With the internet, that's not possible.^3 This leads to speculation and assumptions. Online commenters love to speculate, and often to make conclusions without evidence

No one knows _everything_ that OpenAI does with the data it collects except OpenAi employees. The public only knows about what OpenAi chooses to share

Similarly no one knows what OpenAI will do with the data in the future

One could speculate that it's naive to think that, in the longterm, data collected by "AI" companies will only be used for "AI"

2. The telephone service also had the notion of "unlisted numbers", but that's another tangent for discussion

3. Hence for example people who do port scans of the IPv4 address space will try to prevent the public from accessing them by restricting access to "researchers", etc. Getting access always involves contacting the people with the scans and explaining what the requester will do with the data. In other words, removing speculation


What's the yawn for?

It implies that this is boring and not article/post-worthy (which I agree with).

Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting.


> It implies that this is boring and not article/post-worthy (which I agree with).

It's certainly news to me, and presumably some others, that this exists.


Which part is news?

If certificate transparency is new to you, I feel like there are significantly more interesting articles and conversations that could/should have been submitted instead of "A public log intended for consumption exists, and a company is consuming that log". This post would do literally nothing to enlighten you about CT logs.

If the fact that OpenAI is scraping certificate transparency logs is new and interesting to you, I'd love to know why it is interesting. Perhaps I'm missing something.

Way more interesting reads for people unfamiliar with what certificate transparency is, in my opinion, than this "OpenAI read my CT log" post:

https://googlechrome.github.io/CertificateTransparency/log_s...

https://certificate.transparency.dev/


> I feel like there are significantly more interesting articles

if this is the article that introduces someone to the concept of certificate transparency, then there's nothing wrong with that. graciously, you followed through with links to what you consider more interesting. that is not something a lot of commenters do and just leave it as a snarky comment for someone being one of the lucky 10000 for the day.


Yeah, this is the unspoken part about HTTPS: you enable it, you also announce to the entire world you're serving stuff from specific DNS names :).

(Which is why I hate it that it's so hard to test things locally without having to get a domain and a certificate. I don't want to buy domain names and announce them publicly for the sake of some random script that needs to offer a HTTP endpoint.)

Modern security is introducing a lot of unexpected couplings into software systems, including coupling to political, social and physical reality, which is surprising if you think in terms of programs you write, which most likely shouldn't have any such relationships.

My favorite example of such unexpected coupling, whose failures are still regularly experienced by users, is wall clock time. If your program touches anything related to certificates, even indirectly, suddenly it's coupled to actual real clock and your users better make sure their system time is in synch with the rest of the world, or else things will stop working.


You do know that /etc/hosts is a file you can edit, right? You hopefully also know that you can create your own certificate authority or self signed certificates and add them to your CA store.

> Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting

Oh, I read this as indicating OpenAI may make a move into the security space.


Even if it's just for their internal security initiatives it would make sense given how massive they are. Threat hunting via cert monitoring is very effective.

But it isn’t. Guy posted the fact they sent bot for scraping.

That’s not the intended use for CT logs.


Presumably this is well-known among people that already know about this.

P.S. In the hopes of making this more than just a sarcastic comment, the question of "How do people bootstrap knowledge?" is kind of interesting. [1]

> To tackle a hard problem, it is often wise to reuse and recombine existing knowledge. Such an ability to bootstrap enables us to grow rich mental concepts despite limited cognitive resources. Here we present a computational model of conceptual bootstrapping. This model uses a dynamic conceptual repertoire that can cache and later reuse elements of earlier insights in principled ways, modelling learning as a series of compositional generalizations. This model predicts systematically different learned concepts when the same evidence is processed in different orders, without any extra assumptions about previous beliefs or background knowledge. Across four behavioural experiments (total n = 570), we demonstrate strong curriculum-order and conceptual garden-pathing effects that closely resemble our model predictions and differ from those of alternative accounts. Taken together, this work offers a computational account of how past experiences shape future conceptual discoveries and showcases the importance of curriculum design in human inductive concept inferences.

[1]: https://www.nature.com/articles/s41562-023-01719-1


Because it's hardly news in its context.

What reason?

The CT log tells you about new websites as soon as they come online. Good if you're intending to scrape the web.

[flagged]


The intended purpose of certificate transparency logs is to be viewed by others!

Perhaps you should save your "gross" judgement for when you better understand what's happening?


You are implying that a law is being broken, but isn't this the equivalent of going to city hall to pull public land records?

The whole point of the CT logs is to be a public list of all domains which have TLS certs issued by the Web PKI. People are reading this list. I really don't see what is either surprising or in any way problematic in doing so.

The whole point of CT logs is to make issuance of certificates in the public WebPKI… public.

Certificate transparency log is a Google project. They don’t need to scrape it. They host all the data. It’s one of those projects where Google hosts it because it thinks it genuinely improves the internet, by reducing certificate authority abuse.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: