Elsevier embeds a hash in the PDF metadata that is unique for each download

steerablesafe · on Jan 26, 2022

To verify that a watermark removal method works:

1. Download the content with N accounts, preferably from different networks.

2. Run your watermark removal tool on each downloaded data independently.

3. Check if the processed outputs are bit-for-bit identical.

Have fun writing watermark removal tools.

Aerroon · on Jan 26, 2022

To spell it out further: if the two processed outputs from different accounts are bit-by-bit identical then it is impossible to tell them apart. Ie they cannot contain identifying data to you.

They might still contain data that groups those multiple accounts together though. Eg everyone in X university gets the same watermark.

ce4 · on Jan 26, 2022

Or encode the current date or week into the pdf. They could then correlate their log data with the day or week of retrieval

hiq · on Jan 26, 2022

Doesn't work if the watermark has a time-based component.

hiq · on Jan 26, 2022

I guess I should have expanded: a watermark is useful to reduce the number of possibilities of who could have released it. Over time, the more information is gained, the closer you are to a singleton (the result of the intersection of sets). Whether that takes 1 or 10 watermarks does not make a real difference over time, especially when specialized websites release thousands of such PDFs.

So if you have a watermark that associate files with the time they have been downloaded, and download it from multiple networks etc. at the same time, and remove the diff from these versions, you'll still have the time-based watermark in the resulting file, so you'll leak when you downloaded it.

steerablesafe · on Jan 27, 2022

Well sure. Try not to download them at the same time.

hiq · on Jan 27, 2022

Problem is you only have to fail once to leak information, and there are only a finite number of bits of information you can leak before your identity is known.

torstenvl · on Jan 26, 2022

Of course it does. If two files are identical then by definition there is nothing in them that can be used to tell them apart.

hda111 · on Jan 26, 2022

There is still the metadata the operating system stores with the file, even when the file is 1:1 identical. When you use Apple AirDrop to send a file to a friend the OS remembers who sent the file to your friend. Now if the police ever looks at your friend‘s computer they can see who this file came from. Or your friend saves it onto a usb formatted with an Apple File System the metadata is still preserved.

kingofpandora · on Jan 26, 2022

Sounds like you have an operating system problem not a watermark problem.

Decker87 · on Jan 27, 2022

Nit: This is a function of the file system, not the operating system.

steerablesafe · on Jan 26, 2022

I don't see why.

amelius · on Jan 26, 2022

What if the files are still different?

https://en.wikipedia.org/wiki/Steganography

steerablesafe · on Jan 26, 2022

Then your tool is incomplete. Steganography works by encoding information into side channels, like precise positioning of words in the PDF, fonts, etc...

Your tool can be as crude as running the document through pdftotext and only keeping the text output. It can just throw away all these possible side channels that are not relevant to the actual content.

Others mentioned other more sophisticated side channels, as rewording sentences, but I'm sure authors would not welcome that.

nisegami · on Jan 26, 2022

If the bits are identical, where does the hidden information live?

sharikous · on Jan 26, 2022

Even if I don't like the practice of charging so much for scientific papers, I must say that watermarks are the best form of DRM. You own the file, you can print it, you can save it. That's a stark difference from most schemes going around these days

jmnicolas · on Jan 26, 2022

Except when the watermark is right in the middle of the page and harms reading flow.

I had the case a few months ago very frustrating. The PDF was encrypted, I couldn't remove the watermark.

SSLy · on Jan 26, 2022

There are open source apps that ignore PDF "encryption"

newaccount74 · on Jan 26, 2022

qpdf --decrypt infile.pdf outfile.pdf

jmnicolas · on Jan 26, 2022

Since I wanted to read it asap, I didn't invest too much time trying to defeat it but if you have something reliable I'm all ears for next time.

From what I understand Adobe Acrobat might have done the job, but it's an expensive software and I refuse to run cracks anymore.

Isthatablackgsd · on Jan 26, 2022

Calibre with DeDRM plugin can remove the encryption and DRM. Done this few years ago for my undergrad, it works well. Just import the PDF to Calibre and Calibre will handles it from there (as long you have the DeDRM plugin).

It depends on the encryption being used. If it is coming from Abode Digital Edition, then DeDRM plugin need your credential to decrypt it and convert the file without encryption. It needs your credential since ADE ties the encryption to your credential, so it need that to get the key to decrypt it.

First setup will be a while but once it is properly setup, it will takes seconds to decrypt the encrypted/DRMed PDFs. Even you can print the file without worrying if it will be in garbled mess.

criddell · on Jan 26, 2022

Where do they get the key from?

oauea · on Jan 26, 2022

No key needed, else you'd need one for viewing the content too.

criddell · on Jan 26, 2022

You do need a key just to view. Acrobat 9 and above uses AES with a 256-bit key. I think it's all built on Adobe's Digital Editions technology.

akvadrako · on Jan 26, 2022

Watermarks don't even have to be visible. They could do things that survive re-encoding but are too subtle to see. For example adding a few slightly off color pixels to random letters on the page.

jmnicolas · on Jan 26, 2022

In this case I think the goal is to discourage publishing the PDF since your email will be visible to all the people that downloads it.

I wouldn't mind them protecting their work if it didn't harm readability and if the report was worth the money I paid (but it's another problem).

matheusmoreira · on Jan 26, 2022

> You own the file

Probably not. I bet they're just "allowing" you to have the file.

Buttons840 · on Jan 26, 2022

He means you control the file, and have the ability to copy it and open it on any device, and the legal "owner" of the file doesn't have the power to stop you by any common means.

contravariant · on Jan 26, 2022

Well as far as I'm concerned they only own the distribution rights. They've got no business messing with my files or the system(s) it's on.

Though that's my moral standpoint, from a legal standpoint the situation may well be FUBAR, at least in the U.S.

nemoniac · on Jan 26, 2022

The author of the tweet also supplies a tool for cleaning the metadata in pdfs.

https://gist.github.com/sneakers-the-rat/172e8679b824a3871de...

madars · on Jan 26, 2022

The gold standard for academic watermark removal used to be (and maybe still is?) https://github.com/kanzure/pdfparanoia . Not sure if it edits Elsevier's metadata though.

hdjjhhvvhga · on Jan 26, 2022

This is a very basic tool. It may work against this particular scheme by Elsevier if it's not too sophisticated. A few years ago I was working for another content provider to mark the PDFs, my job was to make sure the watermarks are everywhere and that removing them is difficult while being practically transparent to the user. We also employed steganography etc. so that rendering the PDFs as images and combining them again into PDF would preserve the watermark. We were very clear to the user the watermarks are there and, as far as I can tell, there was no single copy protected in this way uploaded to the public Internet.

sam_lowry_ · on Jan 26, 2022

I wonder how you coped with the ethical side of such a job.

hdjjhhvvhga · on Jan 26, 2022

Here's the story. A middle-size publisher contacted me about the problem they had: someone was distributing their PDFs online using various services, but mainly Scribd. At that time (it was maybe 2012?) it worked like this: the attacker released a document, notified his group, and they started downloading. The publisher immediately filed the DMCA copyright infringement notification and it was handled within 3-4 days - more than enough for people to download the "release". It was going on for a long time and Scribd refused to remove the account of the attacker.[0] After a couple of months they finally gave in and deleted it but of course a new account was immediately created. It was not a mouse-and-cat game, but more like mouse-and-turtle.

So when they asked my help I told them, you don't have a copyright problem, you have a SEO problem - since people looking for their publications could find them easily online with the stolen copies sometimes appearing higher on SERP than their own. They said they would address that and they somehow came to terms with the fact that the past documents are lost, but they would like to avoid this situation in the future.

I checked some of their publications. They were moderately priced and the content was quite interesting. Many of their customers were very supportive - actually they were sending reports when they noticed the pirated copies. And they weren't saying "Why should I pay for your books when I can get them online?" but rather "Please take care of protecting your content since we want you to survive and publish more books". So, to answer your question, I was very happy to work for them, and I would do it again.

[0] I was told they changed their approach later and started to collaborate with publishers.

barry-cotter · on Jan 26, 2022

> someone was distributing their PDFs online using various services, but mainly Scribd

Ah, Scribd, the scummiest company YC has ever had any connection with.

onphonenow · on Jan 26, 2022

So many dark patterns on that site

trogdor · on Jan 27, 2022

What are they tricking users into doing?

afandian · on Jan 26, 2022

As a customer I want a thriving, commercially viable, decentralised publishing industry. I don't want DRM encumbered formats, or a single vendor, or a point of failure that prevents me from reading books I've purchased.

If my email address embedded in the PDF enables that then I think that's reasonably balanced ethics arithmetic.

(edit - I consider public-funded scholarly research to be a different matter to private purchases of commercial books such as fiction or trade textbooks)

amself · on Jan 26, 2022

I would just render each page of such a PDF in two colors (black or white without shades) and turn the images into a new PDF to upload. If that doesn’t make the steganography fail, maybe add a lot of static? Or censor illustrations anyway since many journals provide them independently of the text.

topaz0 · on Jan 26, 2022

"I would just" [do this slightly more complicated transformation] really misses the point. Nobody disputes the fact that you could turn any given human-readable pdf into an untraceable version. E.g. you could copy it all out longhand and make sketches of all the plots. Certainly it would work. The point is that as the necessary transformation becomes more complicated/more work, the fraction of the population that will actually do it will vanish, and the surveillance scheme will be effective again.

hdjjhhvvhga · on Jan 26, 2022

It wouldn't work since we were allowed to manipulate certain aspects of text, too, to a certain extent, in a way that looked pretty much statistically random. That is, by comparing several copies of the same file, the bits were very much different, and the differences were distributed over the whole document. One of the main aims was to make sure the watermark gets preserved over various transformations, and we automatically tested each new document to make sure it works.

berkes · on Jan 26, 2022

You could even just move a few words around, depending on the language. Depending on the language, you could even just move a few words around.

Do that randomly, combine the products, and you might get enough entropy to create unique fingerprints for each download. Randomly do that, combine the products, and you might get enough entropy to create unique fingerprints for each download.

(This silly example can create 4 unique fingerprints)

Wowfunhappy · on Jan 26, 2022

When I write, I put a great deal of thought into how to arrange sentences for maximum clarity or effectiveness. I would not appreciate an eBook service messing with that, even if the meaning was unchanged.

In the most extreme case, imagine if this was a book of poetry.

viraptor · on Jan 26, 2022

For PDF you can do this in a much more subtle way. In a typical block of text every individual letter comes with its own kerning adjustment. You can adjust those in a way that's invisible to the reader but still allows fingerprinting. There's probably 1000 different options too - don't think of moving words as in swapping positions in a sentence. (I know parent suggested it, but that's silly)

matheusmoreira · on Jan 26, 2022

These probably wouldn't survive extraction of the pure text, would they?

out_of_protocol · on Jan 26, 2022

Replacing characters with identical-looking unicode chars, adding extra spaces here and there, adding newlines (and more spaces :)), adding random typos, use dictionary with "safe" word/phrase replacements etc. And don't forget about formulas, charts etc - pure text version is not too useful on its own

hdjjhhvvhga · on Jan 26, 2022

If you deal with fiction and the like where you basically have just text then I think that's correct: it would be trivial to detect the watermarks in various copies by simply comparing them. I was dealing with PDFs containing tables, formulas, illustrations, etc., so a plain-text version would be unusable.

snovv_crash · on Jan 26, 2022

Randomly choose 3 big paragraphs in the entire ebook to add an extra newline in the middle of at the end of a random sentence. This would be my choice if I had to do some kind of invisible watermarking, at least.

hdjjhhvvhga · on Jan 26, 2022

This is one of the many things that could be trivially detected and fixed when you have multiple watermarked copies of the same file.

afandian · on Jan 26, 2022

Closer to home and a bit more extreme, a few transposed numbers in a scholarly article would be enough to rekindle another autism/vaccine conspiracy theory!

hdjjhhvvhga · on Jan 26, 2022

No, this would not work for a couple of reasons. Manipulating the content itself such as changing the order of words is very dangerous as it can influence the meaning, and if you process things at scale it could lead to devastating consequences. But there are many other aspects of text such as kerning and others (a dozen or so in this particular case) that are virtually invisible to the reader but are detectable by a machine. I'd prefer not to get into the details of the implementation here but of course a dedicated team with enough resources could successfully break it after some time - but I believe it wouldn't make any sense economically.

berkes · on Jan 31, 2022

> as kerning and others

Those can be "removed" by rendering to text and regenerating a PDF, though. Or even with print + scan + OCR.

Neither are trivial, but doable.

matheusmoreira · on Jan 26, 2022

This isn't a watermark though. This is giving different people different content. Watermarks don't change the content.

aembleton · on Jan 26, 2022

But Stenography does; just in ways that are imperceptible to humans.

matheusmoreira · on Jan 26, 2022

Moving words around is perceptible to humans though. It can even destroy the meaning of content.

aqme28 · on Jan 26, 2022

I don’t think there is any scheme that works against all steganography, outside of maybe rewriting the whole document “in your own words.”

afandian · on Jan 26, 2022

I'm curious what a legal case would look like around steganography. "Trust that our system says that this string is embedded". Or would the prosecution be obliged to divulge the algorithm.

hdjjhhvvhga · on Jan 26, 2022

I believe it would be enough to demonstrate the functioning of the system in action - there was a crude UI that you could use to extract the string from the input document, so it would be easy to demonstrate the PDFs uploaded by the attacker to a given service do in fact contain that strings.

But in this particular point it wasn't even necessary. After a couple of months it turned out that the person who had been uploading the unprotected versions made a mistake and was located as a 20-something living with his parents in a small house in the East Coast. It was enough to notify them and the malicious activity stopped. The company wasn't interested in extracting every penny from the the kid (or his poor parents), they just wanted him to stop, and one letter from a lawyer was enough. If they wanted to go full steam, they would have involved the police and I'm sure they would have found quite a lot of incriminating evidence on his computer, but they were clear ruining someone else's life was not their aim.

NKosmatos · on Jan 26, 2022

The tweet author also mentions these tools than can help in cleaning PDFs: exiftool: https://exiftool.org qpdf: https://qpdf.sourceforge.io dangerzone (GUI, render PDF as images, then re-OCR everything): https://dangerzone.rocks mat2 (render PDF as images, don't OCR): https://0xacab.org/jvoisin/mat2

miki123211 · on Jan 26, 2022

Over here (Poland) almost everyone does something of that sort.

Our ebook market is so fragmented that there's no DRM solution that all (or even most) e-readers work with. If you add in smartphones and car stereos (for audiobooks), the situation gets even worse. Therefore, most publishers use watermarking instead of DRM, usually giving you Epub, PDF and Mobi, which you can read on any device you want.

The most common form of watermarking, at least when epub is concerned, is a 1px by 1px div containing a nonsensical hex or base64-encoded string. There are rumors of watermarks in cover images and even the text itself, though. Apparently, sometimes spaces get removed or extraneous spaces are added, lines get split up a little differently, or some common spelling or typing mistakes are made. Considering the huge number of alteration points you have, let's say 10000 per book, even doing two alterations lets you uniquely watermark 10000^2 copies.

Similar things are done to audiobooks, whether by modifying the audio itself in imperceptible ways, or by modifying the internal structure of the mp3 files. From what I've heard, messing with how frames are laid out and what's in-between them is a common tactic.

aqme28 · on Jan 26, 2022

What do they do with them? Are there criminal or civil cases with those watermarks? In the US courts I believe those secret watermarks would be made public at trial, so what’s the point?

miki123211 · on Jan 27, 2022

It starts with a letter that basically says "we know what you're doing, we're keeping an eye on you, please stop or you'll end up in jail". I don't know what happens afterwards; I know one person who got such a letter and stopped their illicid activities.

If you were serious about piracy and wanted to release books en masse, you'd probably use stolen credit cards, stolen accounts or something of that sort. I don't think that's the goal here, though, those watermarks are mostly for deterring casual piracy, sharing books with friends and so on.

darkwater · on Jan 26, 2022

But how do they enforce this watermarking on the "pirate" afterwards? Only for big enough cases?

matheusmoreira · on Jan 26, 2022

They probably disable the account of whoever leaked the copyrighted material.

s_dev · on Jan 26, 2022

Probably a maintained whitelist.

newaccount74 · on Jan 26, 2022

They could just call whoever leaked the stuff and ask them to stop.

amself · on Jan 26, 2022

Wouldn’t the unique copies be 10000*9999?

ZetaZero · on Jan 26, 2022

For the purpose here, the difference between 10000^2 and 10000*9999 is irrelevant

Ericson2314 · on Jan 26, 2022

Well, better update sci hub to download twice and then normalize any differences!

simiones · on Jan 26, 2022

> normalize any differences!

Elsevier could easily make this completely non-trivial. For a very silly example, imagine how this would work if each user got a PDF using a different font. How do you automatically normalize the difference between Comic Sans and Times New Roman? Sure, it's trivial to write a tool that understands what "fonts" are and does this (especially for PDF), but you can't do it with a simple binary tool.

And of course Elsevier can do something entirely more complex.

steerablesafe · on Jan 26, 2022

You just pick a font and always replace with that.

simiones · on Jan 26, 2022

Yes, but this works if your tool understands fonts. Maybe they also change paragraph spacing ever so slightly, AND they change some letters with cyrilic alternatives that look the same, AND they add some 0-width spaces, and and and.

amelius · on Jan 26, 2022

They could change "the" into "a" in different places. This is the kind of stuff done when documents that shouldn't be leaked are handed to politicians.

Isthatablackgsd · on Jan 26, 2022

> They could change "the" into "a" in different places.

Bad bad bad bad, again bad idea. That changes the structure and the meaning of the content. A single word replacement can change the context of the entire sentence which can change the content of the entire paper. Changing it can create unintended effects which can make Elsevier to thrash their reputation and universities will move on to different scientific/academic journals site.

If Elsevier tries this method with peer-reviewed papers, then it have to go through the reviews again to ensure that the original and the revision have the equivalent expression which is difficult to do. Authors chose those words and structure to convey their expression in those papers. They chose it for a reason and Elsevier is not going to risk their reputation to change the authors papers without affecting the content of the paper.

amelius · on Jan 26, 2022

They could also change word-spacing and line breaks.

asddubs · on Jan 26, 2022

Then they'd be messing with the actual written content of papers not authored by them, though

amelius · on Jan 26, 2022

They are editors, right?

steerablesafe · on Jan 27, 2022

Editing once, with the approval of the authors is one thing. "Automatic" editing on each download is an other.

iforgotpassword · on Jan 26, 2022

Came here to suggest as much :-)

While you can apparently just strip it from the metadata properly like suggested on twitter, maybe a "low level" approach like comparing the files on a binary level and setting any bytes that differ to 0 would be more robust. It would still work if they move the hash out of the meta data into the document itself. Only downside is that this requires the hash to be fixed size.

roblabla · on Jan 26, 2022

If the PDF contains encrypted blocks, this won't work, as simply zeroing out the bits will break the file. Even worse, if the PDF contains compressed chunks (which happens often enough), such a naive approach won't work either - the chunks would have to be uncompressed before comparison.

Such a tool would need a lot of smarts to work reliably enough. At this point, I feel like a metadata stripper that understands the various watermarking methods may be easier to write.

iforgotpassword · on Jan 26, 2022

Hm, makes sense. But if I were in their shoes, moving the fingerprint out of the metadata would be the first thing to do then, so no easy solution then I guess.

jrib · on Jan 26, 2022

I think this is a clever solution. I'm not seeing why it requires the hash to be a fixed size though.

Why can't the algorithm just keep whatever is common between the two files?

Or just take the shorter side of the diff and zero it out?

steerablesafe · on Jan 26, 2022

I don't recommend normalizing differences, as in using both sources as input to produce one "cleaned" output. It could leave watermarks that happen to be the same in both sources.

I recommend having a tool that works on a single source, then verify that it produces the same output from multiple sources.

Also when downloading multiple times, try to do that from different public IPs and accounts.

MayeulC · on Jan 26, 2022

If we go that far, why not just upload the pre-print to sci-hub if it's close enough?

firstSpeaker · on Jan 26, 2022

At some point the whole state of academic research/papers and publishing should be overhauled. My SO left Academic work and started in the industry because of the mess that exist in the academic research industry. The funding and keeping yourself funded as a researcher is a depressing subject to think about.

YeBanKo · on Jan 27, 2022

Why is getting yourself funded is such a depressing idea? And what is an alternative?

wallaBBB · on Jan 26, 2022

Patents (the way they are used today) are often mentioned on HN as something pretty damaging to progress, but these efforts of maximum monetization of the scientific publishing industry are no less of an evil.

If you can donate to SciHub, and if you're publishing look for open alternatives to Elsevier's claw.

MayeulC · on Jan 26, 2022

At least patents expire after a while, and they are public. They mostly slow down commercial application of new discoveries.

Copyright doesn't really expire in some places, or after a very long time. I believe this is way more damaging to progress. Especially for publication, since science works better in a tight feedback loop (and it doesn't work as well if... authors die before others can reply to their papers).

isoprophlex · on Jan 26, 2022

I've often seen banners stating "Downloaded on such and such date, by this and that university" on papers downloaded from sci-hub. I'm hopeful this hidden metadata won't hurt them either.

camel_Snake · on Jan 26, 2022

this isn't uncommon outside of scihub/libgen as well. I use Google Scholar to search for citations of scientific articles and they'll often have links to pdf hard-copies hosted by university servers. I see Penn State University a lot. Polish universities too, but I'm blocked from access as I dont' have a polish IP

deanc · on Jan 26, 2022

I worked on an a service that did this to prevent unauthorised distribution of something we sold in zip files. You can add a lot of identifiable data to pretty much any file format if you try hard enough.

hardwaresofton · on Jan 26, 2022

I’ve often thought of building a similar service… on a scale of one to NSA how questionable did it feel to work on?

When I think about it doing it well might mean journalists or scientists get caught doing something I morally agree with

deanc · on Jan 26, 2022

1. This was in order to track pirates of the software and revoke their license. Nothing out the ordinary :)

atoav · on Jan 26, 2022

I would download from multiple accounts, and then compare the resulting file, normalize the differences.

y4mi · on Jan 26, 2022

You wouldn't unless you knew they're doing that. And you likely wouldn't even have the option because you'd have to pay multiple licences just to diff them with franc's example.

deanc · on Jan 26, 2022

Sure, this is not foolproof - but you'd be surprised how many people were caught.

durnygbur · on Jan 26, 2022

Elsevier, Axel Springer, et al. - proudly publishing research sponsored by EU and national grants... and hounding anyone who doesn't pay 20 EUR for every PDF! Anyways TIL about mat2 and dangerzone.

leonry · on Jan 26, 2022

Axel Springer is into academic publishing? I guess you rather meant Julius Springer, now Springer Nature (https://en.m.wikipedia.org/wiki/Springer_Nature)

durnygbur · on Feb 5, 2022

> Julius Springer

yeah, I confused these two

Bancakes · on Jan 26, 2022

What happened to the SciHub seeding efforts? Do they still need help with torrents?

jinseokim · on Jan 26, 2022

Be aware that deleting metadata is never enough: There are too many ways to hide some fingerprints on the PDF document.

post-it · on Jan 26, 2022

All you need to do is download two documents and diff them, and delete anything that doesn't overlap.

yjftsjthsd-h · on Jan 26, 2022

You'd need to download from different identities; if I was them I'd be injecting user, IP, organization, date, and a signed hash thereof (tamper evidence if someone does something like change a digit in the IP)

post-it · on Jan 26, 2022

The signed hash doesn't matter because you only need to de-identify the document, not pass it off as someone else's. If the organization finds a document with all of the identifying information removed, they know that someone fucked with their DRM but they don't know who.

yjftsjthsd-h · on Jan 26, 2022

My thought was that if the publisher is trying to hunt people sharing copies, and they have such a copy, it would be useful to be confident that the metadata you embedded is actually accurate; sure, it's obvious if, say, the IP field is zeroed out, but what if they just changed the last octet to 7, and that results in you spending weeks leaning on an ISP to give you the identity of the wrong person? Granted, that's probably more care than Elsevier is likely to take, but the point is that they're passing data through hostile hands, so it'd be sensible to do something for integrity checking.

amelius · on Jan 26, 2022

Not guaranteed to work. Look up steganography.

bob1029 · on Jan 26, 2022

Applying SHA256 to 2 different copies of a PDF and receiving the same hash is deterministic proof that uniquely identifying stenographic techniques have not been used.

vimax · on Jan 26, 2022

That doesn't account for any overlaps in tracking data for groups of users.

Instead of a single per-user unique value, I could use several values that track different groups of users. The set of values together would uniquely identify a user, but for any 2 PDFs there would be at least one shared group value that would exist in both.

Using your method, leaking a single PDF would identify a group containing the 2 users of the PDFs you compared. If the groups are randomized for each new article, every PDF you leak would further identify you as the common member of the leaking groups.

post-it · on Jan 26, 2022

This opens up the opportunity for some kind of distributed file submission tool where you can compare hashes of segments of your document with everyone else's documents in some kind of zero-knowledge way, so that no actual piracy happens until enough people submit their document information for the system to create a de-DRMed copy of the document.

bob1029 · on Jan 26, 2022

This is true, but you have to realize there is a built-in tradeoff regarding specificity. The more "resilient" this approach is to being found out by a hash, the less specific the identification will be.

amelius · on Jan 26, 2022

The point is that it is easily defeated by steganography (i.e., your hashes would all be different).

post-it · on Jan 26, 2022

You just need to strip out the parts of the documents that are different until they hash to the same hash.

eecc · on Jan 26, 2022

My “don’t even call me” list has two entries:

* anything military

* Elsevier

throwaway984393 · on Jan 26, 2022

I like how you're fine with being called by a publisher that charges more than Elsevier and provides less open access papers

bluish29 · on Jan 26, 2022

I wonder if there are any offenders that exceeds Elsevier in this regard.

jimmygrapes · on Jan 26, 2022

I nominate ISO and/or ANSI

NKosmatos · on Jan 26, 2022

Many years ago I had a discussion with a friend about something similar, in the sense of tracking PDF downloads. He wanted to sell educational material in PDFs and he wanted to be able to track pirated copies and who shared his paid PDF and that sort of thing. My proposal was something similar, but using steganography inside the PDF content instead of using metadata that can be (relatively easy) stripped. Each time someone bought the PDF, some trivial data (email, date, IP...) would be embedded in the first and last page of the PDF (title page & blank page) at specific coordinates with very small font size and almost white color (same as the page), so that later someone could read them if needed. Yeah, a silly solution and perhaps that's why it didn't move any further past the proposal phase.

Hope I'm not giving any ideas to Elsevier and all the other greedy publishers with this ;-)

lr1970 · on Jan 26, 2022

If a watermarked PDF ended up on the internet it does not necessarily mean that the person who purchased the said PDF leaked it themself or did anything wrong for that matter. Computers are hacked, stolen all the time. At Universities machines in the lab are oftentimes shared and document dumps exist on shared partitions. In a court of law (at least in the USA) the burden of proof is on the plaintiff. It could be expensive and difficult to prove that the PDF purchaser did upload it themselves and broke the law. Similar to the music and movie pirates - going mostly after big fish.

chiefalchemist · on Jan 26, 2022

I see mention of removal tools. Is this something that could be baked into a browser so it happens automatically during the download process? Or some other way to make it automatic, as it's more likely to happen.

As, along the same lines, could the original be "scrubbed" for storage so there's no "paper trail" of you having received it?

dodgerdan · on Jan 26, 2022

Probably a uid rather than a hash. And the token has corresponding metadata like ip, account number, date/time etc.

Eddy_Viscosity2 · on Jan 26, 2022

Print and then scan the paper for the old-school slightly blurry, often misaligned, feel that papers should have.

leemailll · on Jan 26, 2022

a hash in meta might be news, but such behavior is not unique to Elsevier. Years ago journal based on Highwire platform add the date, the institution, or maybe ip address to the side boundary of pdfs.

PicassoCTs · on Jan 26, 2022

Most likely to grab funds from librarys who "lost" the document.

wly_cdgr · on Jan 26, 2022

So just take pics of the pages and convert the pics back to a PDF

squarefoot · on Jan 26, 2022

A motivated publisher could embed codes by altering in subtle ways the differences in distances or color between adjacent characters, so that they would survive most color or grey scale conversions; a seemingly innocuous frame drawn around a photo could be either larger or smaller by say one millimeter, representing de facto a bit, therefore using enough pages they could identify a book among billions. Unfortunately there's no way to be 100% sure that a complex document doesn't contain some form of embedded code.

asdff · on Jan 26, 2022

You could try and break this by adding something like some random noise or jitter maybe, or slightly transforming the proportions of the pages perhaps, or shifting colors in a stochastic way would probably complicate their efforts. The frame around the photo will no longer be exactly 1.1231 mm and will throw off their embedded code reading systems. The colors won't be the same hex codes they are expecting and won't be shifted evenly. Spacing is now all off between the characters.

viraptor · on Jan 26, 2022

Good information hiding and watermarking doesn't get affected by common transformations. Most changes will be relative to other content, so noise and resizing shouldn't impact it, especially if there's redundancy in the fingerprint codes. It's not "frame is 1.1231 mm == 1", but rather "frame is slightly wider than average of other pages == push 1 into FEC".

asdff · on Jan 26, 2022

How would be able to get past stochastic transformations? "frame is slightly wider than average of other pages == push 1 into FEC" could be stymied by making pages randomly wider or narrower, so now the average is different and the frame you are expecting to be slightly wider may even be slightly narrower than the average now, garbling your encoding.

viraptor · on Jan 26, 2022

Sure, but you're taking two things for granted: you know this is the approach used, and it's the only approach used. If we assume those, you can work around any watermark.

json_dirs · on Jan 26, 2022

interesting -- i don't know enough about PDFs, but I saw something in a PDF /OpenAction that sounded like this. wonder if anyone who knows more could see? https://twitter.com/json_dirs/status/1486274161655771136?s=2...

wly_cdgr · on Jan 27, 2022

Reminds me of printergate. Fair. Ok, so what about using an OCR tool to convert to text, then converting that back to PDF?

d4v3 · on Jan 26, 2022

Easier to just strip out the metadata

yohannparis · on Jan 26, 2022

A researcher explains that if you want a paper, you can directly email them, and they will happily send you one. Avoiding all this non-sense.

cblconfederate · on Jan 26, 2022

But sci-hub does not embed anything. Checkmate elsevier

I m considering that elbakyan has enough credibility to start her own series of open source journals

mherrmann · on Jan 26, 2022

I've often wondered why such schemes are not more ubiquitous, for instance for porn movies. Does anybody have an idea?

steerablesafe · on Jan 26, 2022

It probably doesn't worth the cost. Metadata can be trivially stripped, and altering the encoded video and audio streams for each user for fingerprinting is costly.

herodotus · on Jan 26, 2022

I think that, on MacOS, if you print the PDF to a PDF and save that, the metadata will not be transferred.

zibzab · on Jan 26, 2022

Someone should check this against the sci-hub library.

Could make a nice research project :)

rob_c · on Jan 26, 2022

Ok, that's just being a bad actor...

iqanq · on Jan 26, 2022

It makes sense to do something like this if you want to detect who is exfiltrating data.