> One of the first things you might notice is that it has a curious bit of coloring in it. What gives? Shouldn’t it just be black and white since the text is black? Are they trolling us with colored letters?
> I’m actually not 100% sure why this happens
It's called sub-pixel rendering or sub-pixel anti-aliasing[1]. Most LCD displays have RGB[2][3] subpixel layouts. Sub-pixel AA takes advantage of this to increase the horizontal resolution of a raster framebuffer by a factor of 3, at the cost of this colour ringing on high-contrast edges such as text—especially on lower-resolution displays. Windows' implementation is called ClearType[4].
It is increasingly less useful as pixel densities increase.
The best way to redact a document in practice is not to go over it with blur or black boxes, but to completely delete the text you want to redact, and replace it with a placeholder such as REDACTED. This means you won't leak length or anything else. Of course, it can change formatting and page numbers, so it's worth being mindful of that.
Just make sure you don't use a file format which supports history (such as PDF).
If you have to redact with black boxes, the safest way is to redact-rasterise-OCR. That obviously breaks some stuff though and degrades your document quality.
> Of course, it can change formatting and page numbers, so it's worth being mindful of that.
The way this is phrased may give the impression that this is not much of a deal breaker in practical application as long as you just keep it in mind, but it really is.
On the other hand, leaving the length of the original text unchanged can leak information. Imagine a person's full name is redacted but it's only 7 characters long in total. That massively cuts down on the possibilities and may be enough to unmask their identity if there are only a few thousand names that might plausibly be listed.
It really depends on the context. If you’re talking about evidence in a court case where things are referred to by page number that’s one thing, but if you’re a public authority releasing some internal manual, there’s no need to keep the formatting identical, especially if you’re omitting entire sections.
I doubt many people are doing this. Many times the pixelation is visibly obviously terrible. In fact if you look at the pixelation challenge in this post, you can almost simply read off the answer with just a bit of concentration. I see a lot of things like that, sadly.
The cryptographers like to say that just because you've created a cryptography scheme strong enough to defeat yourself doesn't mean you've created a strong cryptography scheme. I feel like redaction is much the same. You really need to go with solid, proved techniques, and avoid the proved-weak ones.
Fortunately, most text is proportionally spaced, rather than monospaced, so you cannot really determine the number of characters, only the length in pixels/points of the redacted phrase.
You're right. In some cases that's plenty, such as with the sentence "Mr. Jones was found XXXXXXXXXXXXXX of the crime."
If you know it's likely either "guilty" or "not guilty," then the length (regardless of font) will automatically give away which of these two options it is since you can just 'bruteforce' the whole solution set of two options. No in-place redaction will work for a case like that, except possibly a longer one where the whole sentence is redacted and hopefully no one can guess what kind of sentence it was.
Not necessarily. You can white/black out the text and replace it with something that's longer or shorter than the original. It's really just a matter of how much effort the person wants to put into the redaction.
Exactly, the stupid bit is using the original text to drive the pixelization. Another approach would be to just generate random gray pixel values over the redacted text. Simple. Simpler even than the weird assumption that you would use the original text.
Step 1: Draw a solid box, of whatever color you choose, that covers everything you want to censor, because you don't want to risk leaking any information from it in any way.
Step 2: Draw whatever censored-looking stuff you want on that now-empty space because you want the censoring to look fancy.
Step 3: Spend some time thinking if step 2 was really worth it or not.
Best to redact using black rectangles. However, when you redact a PDF, make sure you don't just draw rectangles over the text, the text remains and can be easily extracted. Add the rectangle as is but then export to raster images like PNG/JPG then recombine these in a PDF.
If your redacted documents are to be printed, ensure that you first redact then print. If you must print then redact (let's say you prefer to redact a physical document), photocopy the redacted document and distribute that, so the text cannot be forensically recovered.
Anything else (blurs etc) are prone to reversal or leakage.
> photocopy the redacted document and distribute that, so the text cannot be forensically recovered.
To be really safe, you will want to check whether the photocopier forgets about your document after copying. Many photocopiers store documents on hard disks before copying them, and there’s no guarantee that the data is deleted afterwards.
“In 2008, Sharp commissioned a survey on copier security that found 60 percent of Americans "don't know" that copiers store images on a hard drive. Sharp tried to warn consumers about the simple act of copying.
"It's falling on deaf ears," McLaughlin said. "Or people don't feel it's important, or 'we'll take care of it later.'"
All the major manufacturers told us they offer security or encryption packages on their products. One product from Sharp automatically erases an image from the hard drive. It costs $500.
But evidence keeps piling up in warehouses that many businesses are unwilling to pay for such protection, and that the average American is completely unaware of the dangers posed by digital copiers.”
(I think selling users that as an add-on is almost criminal. Copiers should make sure documents don’t survive on disk after copying (not even as bits in a now deleted, but not overwritten, file), unless explicitly instructed by the user to do so)
I have a vague memory of someone being able to fairly accurately estimate redacted words and phrases in a government document by using the size of the blacked out portion along with the font metrics. I think the safest way to redact text would be to first normalize it all to the same text (maybe something like "etaoin shrdlu" from the hot type era), then black it all out, then there would be even less information leaked.
This is especially true when the list of potential words is known. So if you know the court case is about Mark, Sandeep, and Elizabeth, and the names are redacted with boxes, then it's trivial to unredact each name by just looking at the length of the boxes.
tbh I would be wary of this too; if you use a run-of-the mill office copy station that defaults to color copies you never know what information might leak in the subtle gray scales.
If you have the paid version of Acrobat, the built-in redact tool is pretty nice. It lets you just highlight the text you want gone, but it will properly remove that text and the image.
I can imagine a situation where document was first scanned into jpeg and then redacted, could the information leak into the 8x8 blocks of jpeg compression?
If you need a fool-proof process flatten everything to a 'monochrome' (black and white) image and draw over the redacted areas with a total fill.
This looses text search, but any process that has to understand which objects are being redacted is vastly more complex and thus ripe for errors. Particularly if complex formats like word processor, spreadsheet, or PDF files are involved.
Traditional redaction implies a retention of length unless an entire block is considered protected / classified / secret in some way.
To destroy even length, or possibly stenographic content a transcription and maybe rephrasing might be required. However at that point it's better to go for a summary that avoids the content and know for sure that there's loss of meaning.
I never did that myself so I wonder: what tools are typically used to pixelate text? And do those tools have to do the actual "pixelation"? Why can't you just have a tool in, say, Microsoft Word, that would allow you to select the text, and then chose "pixelate" option, and generate completely random pattern of squares in different shades of grey?
> we’re focusing on one such technique – pixelation – and will show you why it’s a no-good, bad, insecure, surefire way to get your sensitive data leaked.
It isn't, you just use random generated text instead of the actual one, so you get the aesthetic benefit (black bars are ugly) without the leak
(though there is still the issue of leaking length, which might be important for named, e.g. there are only 10 generals, and only one had a very long name, so redacted you still get to guess who that person is, would need to break the formatting of the original to hide that)
Word length can be extremely revealing. A I'd suggest covering the entire document. Plus a large (& random) number of junk pages which you prepend and append to the document. One pixel per page should be fine.
It's still not a good idea. Analysts may attempt to de-cloak the word with varying degrees of success by analyzing the width of the redacted word against font character widths and using techniques from the study of Natural Language Processing to identify likely word candidates that fit in context (e.g., with respect to the rest of the sentence or document).
> I’m actually not 100% sure why this happens
It's called sub-pixel rendering or sub-pixel anti-aliasing[1]. Most LCD displays have RGB[2][3] subpixel layouts. Sub-pixel AA takes advantage of this to increase the horizontal resolution of a raster framebuffer by a factor of 3, at the cost of this colour ringing on high-contrast edges such as text—especially on lower-resolution displays. Windows' implementation is called ClearType[4].
It is increasingly less useful as pixel densities increase.
[1]: https://en.wikipedia.org/wiki/Subpixel_rendering
[2]: https://geometrian.com/programming/reference/subpixelzoo/squ...
[3]: https://geometrian.com/programming/reference/subpixelzoo/squ...
[4]: https://learn.microsoft.com/en-gb/typography/cleartype/