Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That this is the state of things is pretty amazing


PDF is a shitty format.

As proof, I offer: can't grep on it.

Which, BTW, if it was possible, would have prevented this mess.


Linux has a pdfgrep command which works pretty well (unless the PDF has images of text).


Unless the PDF has every letter placed by independent PostScript like commands or - worse - text has been imported from an SVG file.

pdfgrep is at best probabilistic and I stand by my comment..


Agreed. I work on a solution for extracting data from random documents ( invoices, payslips, you name it ) for natives pdf in the wild ( not scans) we gave up : and just rasterize to send to an OCR software... It's also probabilistic, but way more reliable.


Spotlight can also search within PDFs


And yet it is the format documents are in. Government websites are full of PDFs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: