That this is the state of things is pretty amazing

ur-whale · on Jan 29, 2021

PDF is a shitty format.

As proof, I offer: can't grep on it.

Which, BTW, if it was possible, would have prevented this mess.

statquontrarian · on Jan 29, 2021

Linux has a pdfgrep command which works pretty well (unless the PDF has images of text).

ur-whale · on Jan 29, 2021

Unless the PDF has every letter placed by independent PostScript like commands or - worse - text has been imported from an SVG file.

pdfgrep is at best probabilistic and I stand by my comment..

flal_ · on Jan 30, 2021

Agreed. I work on a solution for extracting data from random documents ( invoices, payslips, you name it ) for natives pdf in the wild ( not scans) we gave up : and just rasterize to send to an OCR software... It's also probabilistic, but way more reliable.

addandsubtract · on Jan 30, 2021

Spotlight can also search within PDFs

Aerroon · on Jan 29, 2021

And yet it is the format documents are in. Government websites are full of PDFs.