To spell it out further: if the two processed outputs from different accounts are bit-by-bit identical then it is impossible to tell them apart. Ie they cannot contain identifying data to you.
They might still contain data that groups those multiple accounts together though. Eg everyone in X university gets the same watermark.
I guess I should have expanded: a watermark is useful to reduce the number of possibilities of who could have released it. Over time, the more information is gained, the closer you are to a singleton (the result of the intersection of sets). Whether that takes 1 or 10 watermarks does not make a real difference over time, especially when specialized websites release thousands of such PDFs.
So if you have a watermark that associate files with the time they have been downloaded, and download it from multiple networks etc. at the same time, and remove the diff from these versions, you'll still have the time-based watermark in the resulting file, so you'll leak when you downloaded it.
Problem is you only have to fail once to leak information, and there are only a finite number of bits of information you can leak before your identity is known.
There is still the metadata the operating system stores with the file, even when the file is 1:1 identical. When you use Apple AirDrop to send a file to a friend the OS remembers who sent the file to your friend. Now if the police ever looks at your friend‘s computer they can see who this file came from. Or your friend saves it onto a usb formatted with an Apple File System the metadata is still preserved.
Then your tool is incomplete. Steganography works by encoding information into side channels, like precise positioning of words in the PDF, fonts, etc...
Your tool can be as crude as running the document through pdftotext and only keeping the text output. It can just throw away all these possible side channels that are not relevant to the actual content.
Others mentioned other more sophisticated side channels, as rewording sentences, but I'm sure authors would not welcome that.
Even if I don't like the practice of charging so much for scientific papers, I must say that watermarks are the best form of DRM. You own the file, you can print it, you can save it. That's a stark difference from most schemes going around these days
Calibre with DeDRM plugin can remove the encryption and DRM. Done this few years ago for my undergrad, it works well. Just import the PDF to Calibre and Calibre will handles it from there (as long you have the DeDRM plugin).
It depends on the encryption being used. If it is coming from Abode Digital Edition, then DeDRM plugin need your credential to decrypt it and convert the file without encryption. It needs your credential since ADE ties the encryption to your credential, so it need that to get the key to decrypt it.
First setup will be a while but once it is properly setup, it will takes seconds to decrypt the encrypted/DRMed PDFs. Even you can print the file without worrying if it will be in garbled mess.
Watermarks don't even have to be visible. They could do things that survive re-encoding but are too subtle to see. For example adding a few slightly off color pixels to random letters on the page.
He means you control the file, and have the ability to copy it and open it on any device, and the legal "owner" of the file doesn't have the power to stop you by any common means.
The gold standard for academic watermark removal used to be (and maybe still is?) https://github.com/kanzure/pdfparanoia . Not sure if it edits Elsevier's metadata though.
This is a very basic tool. It may work against this particular scheme by Elsevier if it's not too sophisticated. A few years ago I was working for another content provider to mark the PDFs, my job was to make sure the watermarks are everywhere and that removing them is difficult while being practically transparent to the user. We also employed steganography etc. so that rendering the PDFs as images and combining them again into PDF would preserve the watermark. We were very clear to the user the watermarks are there and, as far as I can tell, there was no single copy protected in this way uploaded to the public Internet.
Here's the story. A middle-size publisher contacted me about the problem they had: someone was distributing their PDFs online using various services, but mainly Scribd. At that time (it was maybe 2012?) it worked like this: the attacker released a document, notified his group, and they started downloading. The publisher immediately filed the DMCA copyright infringement notification and it was handled within 3-4 days - more than enough for people to download the "release". It was going on for a long time and Scribd refused to remove the account of the attacker.[0] After a couple of months they finally gave in and deleted it but of course a new account was immediately created. It was not a mouse-and-cat game, but more like mouse-and-turtle.
So when they asked my help I told them, you don't have a copyright problem, you have a SEO problem - since people looking for their publications could find them easily online with the stolen copies sometimes appearing higher on SERP than their own. They said they would address that and they somehow came to terms with the fact that the past documents are lost, but they would like to avoid this situation in the future.
I checked some of their publications. They were moderately priced and the content was quite interesting. Many of their customers were very supportive - actually they were sending reports when they noticed the pirated copies. And they weren't saying "Why should I pay for your books when I can get them online?" but rather "Please take care of protecting your content since we want you to survive and publish more books". So, to answer your question, I was very happy to work for them, and I would do it again.
[0] I was told they changed their approach later and started to collaborate with publishers.
As a customer I want a thriving, commercially viable, decentralised publishing industry. I don't want DRM encumbered formats, or a single vendor, or a point of failure that prevents me from reading books I've purchased.
If my email address embedded in the PDF enables that then I think that's reasonably balanced ethics arithmetic.
(edit - I consider public-funded scholarly research to be a different matter to private purchases of commercial books such as fiction or trade textbooks)
I would just render each page of such a PDF in two colors (black or white without shades) and turn the images into a new PDF to upload. If that doesn’t make the steganography fail, maybe add a lot of static? Or censor illustrations anyway since many journals provide them independently of the text.
"I would just" [do this slightly more complicated transformation] really misses the point. Nobody disputes the fact that you could turn any given human-readable pdf into an untraceable version. E.g. you could copy it all out longhand and make sketches of all the plots. Certainly it would work. The point is that as the necessary transformation becomes more complicated/more work, the fraction of the population that will actually do it will vanish, and the surveillance scheme will be effective again.
It wouldn't work since we were allowed to manipulate certain aspects of text, too, to a certain extent, in a way that looked pretty much statistically random. That is, by comparing several copies of the same file, the bits were very much different, and the differences were distributed over the whole document. One of the main aims was to make sure the watermark gets preserved over various transformations, and we automatically tested each new document to make sure it works.
You could even just move a few words around, depending on the language.
Depending on the language, you could even just move a few words around.
Do that randomly, combine the products, and you might get enough entropy to create unique fingerprints for each download.
Randomly do that, combine the products, and you might get enough entropy to create unique fingerprints for each download.
(This silly example can create 4 unique fingerprints)
When I write, I put a great deal of thought into how to arrange sentences for maximum clarity or effectiveness. I would not appreciate an eBook service messing with that, even if the meaning was unchanged.
In the most extreme case, imagine if this was a book of poetry.
For PDF you can do this in a much more subtle way. In a typical block of text every individual letter comes with its own kerning adjustment. You can adjust those in a way that's invisible to the reader but still allows fingerprinting. There's probably 1000 different options too - don't think of moving words as in swapping positions in a sentence. (I know parent suggested it, but that's silly)
Replacing characters with identical-looking unicode chars, adding extra spaces here and there, adding newlines (and more spaces :)), adding random typos, use dictionary with "safe" word/phrase replacements etc. And don't forget about formulas, charts etc - pure text version is not too useful on its own
If you deal with fiction and the like where you basically have just text then I think that's correct: it would be trivial to detect the watermarks in various copies by simply comparing them. I was dealing with PDFs containing tables, formulas, illustrations, etc., so a plain-text version would be unusable.
Randomly choose 3 big paragraphs in the entire ebook to add an extra newline in the middle of at the end of a random sentence. This would be my choice if I had to do some kind of invisible watermarking, at least.
Closer to home and a bit more extreme, a few transposed numbers in a scholarly article would be enough to rekindle another autism/vaccine conspiracy theory!
No, this would not work for a couple of reasons. Manipulating the content itself such as changing the order of words is very dangerous as it can influence the meaning, and if you process things at scale it could lead to devastating consequences. But there are many other aspects of text such as kerning and others (a dozen or so in this particular case) that are virtually invisible to the reader but are detectable by a machine. I'd prefer not to get into the details of the implementation here but of course a dedicated team with enough resources could successfully break it after some time - but I believe it wouldn't make any sense economically.
I'm curious what a legal case would look like around steganography. "Trust that our system says that this string is embedded". Or would the prosecution be obliged to divulge the algorithm.
I believe it would be enough to demonstrate the functioning of the system in action - there was a crude UI that you could use to extract the string from the input document, so it would be easy to demonstrate the PDFs uploaded by the attacker to a given service do in fact contain that strings.
But in this particular point it wasn't even necessary. After a couple of months it turned out that the person who had been uploading the unprotected versions made a mistake and was located as a 20-something living with his parents in a small house in the East Coast. It was enough to notify them and the malicious activity stopped. The company wasn't interested in extracting every penny from the the kid (or his poor parents), they just wanted him to stop, and one letter from a lawyer was enough. If they wanted to go full steam, they would have involved the police and I'm sure they would have found quite a lot of incriminating evidence on his computer, but they were clear ruining someone else's life was not their aim.
Over here (Poland) almost everyone does something of that sort.
Our ebook market is so fragmented that there's no DRM solution that all (or even most) e-readers work with. If you add in smartphones and car stereos (for audiobooks), the situation gets even worse. Therefore, most publishers use watermarking instead of DRM, usually giving you Epub, PDF and Mobi, which you can read on any device you want.
The most common form of watermarking, at least when epub is concerned, is a 1px by 1px div containing a nonsensical hex or base64-encoded string. There are rumors of watermarks in cover images and even the text itself, though. Apparently, sometimes spaces get removed or extraneous spaces are added, lines get split up a little differently, or some common spelling or typing mistakes are made. Considering the huge number of alteration points you have, let's say 10000 per book, even doing two alterations lets you uniquely watermark 10000^2 copies.
Similar things are done to audiobooks, whether by modifying the audio itself in imperceptible ways, or by modifying the internal structure of the mp3 files. From what I've heard, messing with how frames are laid out and what's in-between them is a common tactic.
What do they do with them? Are there criminal or civil cases with those watermarks? In the US courts I believe those secret watermarks would be made public at trial, so what’s the point?
It starts with a letter that basically says "we know what you're doing, we're keeping an eye on you, please stop or you'll end up in jail". I don't know what happens afterwards; I know one person who got such a letter and stopped their illicid activities.
If you were serious about piracy and wanted to release books en masse, you'd probably use stolen credit cards, stolen accounts or something of that sort. I don't think that's the goal here, though, those watermarks are mostly for deterring casual piracy, sharing books with friends and so on.
Elsevier could easily make this completely non-trivial. For a very silly example, imagine how this would work if each user got a PDF using a different font. How do you automatically normalize the difference between Comic Sans and Times New Roman? Sure, it's trivial to write a tool that understands what "fonts" are and does this (especially for PDF), but you can't do it with a simple binary tool.
And of course Elsevier can do something entirely more complex.
Yes, but this works if your tool understands fonts. Maybe they also change paragraph spacing ever so slightly, AND they change some letters with cyrilic alternatives that look the same, AND they add some 0-width spaces, and and and.
They could change "the" into "a" in different places. This is the kind of stuff done when documents that shouldn't be leaked are handed to politicians.
> They could change "the" into "a" in different places.
Bad bad bad bad, again bad idea. That changes the structure and the meaning of the content. A single word replacement can change the context of the entire sentence which can change the content of the entire paper. Changing it can create unintended effects which can make Elsevier to thrash their reputation and universities will move on to different scientific/academic journals site.
If Elsevier tries this method with peer-reviewed papers, then it have to go through the reviews again to ensure that the original and the revision have the equivalent expression which is difficult to do. Authors chose those words and structure to convey their expression in those papers. They chose it for a reason and Elsevier is not going to risk their reputation to change the authors papers without affecting the content of the paper.
While you can apparently just strip it from the metadata properly like suggested on twitter, maybe a "low level" approach like comparing the files on a binary level and setting any bytes that differ to 0 would be more robust. It would still work if they move the hash out of the meta data into the document itself. Only downside is that this requires the hash to be fixed size.
If the PDF contains encrypted blocks, this won't work, as simply zeroing out the bits will break the file. Even worse, if the PDF contains compressed chunks (which happens often enough), such a naive approach won't work either - the chunks would have to be uncompressed before comparison.
Such a tool would need a lot of smarts to work reliably enough. At this point, I feel like a metadata stripper that understands the various watermarking methods may be easier to write.
Hm, makes sense. But if I were in their shoes, moving the fingerprint out of the metadata would be the first thing to do then, so no easy solution then I guess.
I don't recommend normalizing differences, as in using both sources as input to produce one "cleaned" output. It could leave watermarks that happen to be the same in both sources.
I recommend having a tool that works on a single source, then verify that it produces the same output from multiple sources.
Also when downloading multiple times, try to do that from different public IPs and accounts.
At some point the whole state of academic research/papers and publishing should be overhauled. My SO left Academic work and started in the industry because of the mess that exist in the academic research industry. The funding and keeping yourself funded as a researcher is a depressing subject to think about.
Patents (the way they are used today) are often mentioned on HN as something pretty damaging to progress, but these efforts of maximum monetization of the scientific publishing industry are no less of an evil.
If you can donate to SciHub, and if you're publishing look for open alternatives to Elsevier's claw.
At least patents expire after a while, and they are public. They mostly slow down commercial application of new discoveries.
Copyright doesn't really expire in some places, or after a very long time. I believe this is way more damaging to progress. Especially for publication, since science works better in a tight feedback loop (and it doesn't work as well if... authors die before others can reply to their papers).
I've often seen banners stating "Downloaded on such and such date, by this and that university" on papers downloaded from sci-hub. I'm hopeful this hidden metadata won't hurt them either.
this isn't uncommon outside of scihub/libgen as well. I use Google Scholar to search for citations of scientific articles and they'll often have links to pdf hard-copies hosted by university servers. I see Penn State University a lot. Polish universities too, but I'm blocked from access as I dont' have a polish IP
I worked on an a service that did this to prevent unauthorised distribution of something we sold in zip files. You can add a lot of identifiable data to pretty much any file format if you try hard enough.
You wouldn't unless you knew they're doing that. And you likely wouldn't even have the option because you'd have to pay multiple licences just to diff them with franc's example.
Elsevier, Axel Springer, et al. - proudly publishing research sponsored by EU and national grants... and hounding anyone who doesn't pay 20 EUR for every PDF! Anyways TIL about mat2 and dangerzone.
You'd need to download from different identities; if I was them I'd be injecting user, IP, organization, date, and a signed hash thereof (tamper evidence if someone does something like change a digit in the IP)
The signed hash doesn't matter because you only need to de-identify the document, not pass it off as someone else's. If the organization finds a document with all of the identifying information removed, they know that someone fucked with their DRM but they don't know who.
My thought was that if the publisher is trying to hunt people sharing copies, and they have such a copy, it would be useful to be confident that the metadata you embedded is actually accurate; sure, it's obvious if, say, the IP field is zeroed out, but what if they just changed the last octet to 7, and that results in you spending weeks leaning on an ISP to give you the identity of the wrong person? Granted, that's probably more care than Elsevier is likely to take, but the point is that they're passing data through hostile hands, so it'd be sensible to do something for integrity checking.
Applying SHA256 to 2 different copies of a PDF and receiving the same hash is deterministic proof that uniquely identifying stenographic techniques have not been used.
That doesn't account for any overlaps in tracking data for groups of users.
Instead of a single per-user unique value, I could use several values that track different groups of users. The set of values together would uniquely identify a user, but for any 2 PDFs there would be at least one shared group value that would exist in both.
Using your method, leaking a single PDF would identify a group containing the 2 users of the PDFs you compared.
If the groups are randomized for each new article, every PDF you leak would further identify you as the common member of the leaking groups.
This opens up the opportunity for some kind of distributed file submission tool where you can compare hashes of segments of your document with everyone else's documents in some kind of zero-knowledge way, so that no actual piracy happens until enough people submit their document information for the system to create a de-DRMed copy of the document.
This is true, but you have to realize there is a built-in tradeoff regarding specificity. The more "resilient" this approach is to being found out by a hash, the less specific the identification will be.
Many years ago I had a discussion with a friend about something similar, in the sense of tracking PDF downloads. He wanted to sell educational material in PDFs and he wanted to be able to track pirated copies and who shared his paid PDF and that sort of thing. My proposal was something similar, but using steganography inside the PDF content instead of using metadata that can be (relatively easy) stripped. Each time someone bought the PDF, some trivial data (email, date, IP...) would be embedded in the first and last page of the PDF (title page & blank page) at specific coordinates with very small font size and almost white color (same as the page), so that later someone could read them if needed. Yeah, a silly solution and perhaps that's why it didn't move any further past the proposal phase.
Hope I'm not giving any ideas to Elsevier and all the other greedy publishers with this ;-)
If a watermarked PDF ended up on the internet it does not necessarily mean that the person who purchased the said PDF leaked it themself or did anything wrong for that matter. Computers are hacked, stolen all the time. At Universities machines in the lab are oftentimes shared and document dumps exist on shared partitions. In a court of law (at least in the USA) the burden of proof is on the plaintiff. It could be expensive and difficult to prove that the PDF purchaser did upload it themselves and broke the law. Similar to the music and movie pirates - going mostly after big fish.
I see mention of removal tools. Is this something that could be baked into a browser so it happens automatically during the download process? Or some other way to make it automatic, as it's more likely to happen.
As, along the same lines, could the original be "scrubbed" for storage so there's no "paper trail" of you having received it?
a hash in meta might be news, but such behavior is not unique to Elsevier. Years ago journal based on Highwire platform add the date, the institution, or maybe ip address to the side boundary of pdfs.
A motivated publisher could embed codes by altering in subtle ways the differences in distances or color between adjacent characters, so that they would survive most color or grey scale conversions; a seemingly innocuous frame drawn around a photo could be either larger or smaller by say one millimeter, representing de facto a bit, therefore using enough pages they could identify a book among billions.
Unfortunately there's no way to be 100% sure that a complex document doesn't contain some form of embedded code.
You could try and break this by adding something like some random noise or jitter maybe, or slightly transforming the proportions of the pages perhaps, or shifting colors in a stochastic way would probably complicate their efforts. The frame around the photo will no longer be exactly 1.1231 mm and will throw off their embedded code reading systems. The colors won't be the same hex codes they are expecting and won't be shifted evenly. Spacing is now all off between the characters.
Good information hiding and watermarking doesn't get affected by common transformations. Most changes will be relative to other content, so noise and resizing shouldn't impact it, especially if there's redundancy in the fingerprint codes. It's not "frame is 1.1231 mm == 1", but rather "frame is slightly wider than average of other pages == push 1 into FEC".
How would be able to get past stochastic transformations? "frame is slightly wider than average of other pages == push 1 into FEC" could be stymied by making pages randomly wider or narrower, so now the average is different and the frame you are expecting to be slightly wider may even be slightly narrower than the average now, garbling your encoding.
Sure, but you're taking two things for granted: you know this is the approach used, and it's the only approach used. If we assume those, you can work around any watermark.
It probably doesn't worth the cost. Metadata can be trivially stripped, and altering the encoded video and audio streams for each user for fingerprinting is costly.
1. Download the content with N accounts, preferably from different networks.
2. Run your watermark removal tool on each downloaded data independently.
3. Check if the processed outputs are bit-for-bit identical.
Have fun writing watermark removal tools.