As usual with Tesseract it's great if you want generic OCR that gives 97-98% accuracy with little to no work, but it's never going to hit 100% or near across all fonts, diagrammatic input etc.
I have just given up on Tesseract for parsing engineering diagrams where the one single font is used throughout and rolled my own non-ML OCR using opencv that requires a little setup but easy hits 99.9% on a character basis and is way quicker, plus uses way less data stored.
I couldn't get Tesseract to pick Z against 2, I against 1 etc reliably when no serifs, no matter what i did for preprocessing - dilate/erode, threshold, border, contours with smoothing, x2 and x5 resize, all the modes and so on.
Because it's engineering diagrams only limited contextual guidance can be used, it's basically independent characters each time.
They say retraining Tesseract won't help, but surely it would for certain fonts - what am I supposed to do about the 1 vs I then, as one example? I could pre or post process, but once I set up to do that for a few specific cases I can do a little more and then do the whole alphabet and get way better performance without Tesseract at all.
And then there is the stupid Autocad font designed for pen plotters like HP-7585 or HP-7475 that no one owns any more, but the font with the square zero lives on, and that is a problem on it's own.
Anyhoo, I tried out the OCRmyPDF on a few things, it could be useful in the right circumstances, but by no means magic. Nanonets or pdfsumo seem to do way better, plus can handle tables.
What a useful tool! Right now I scan my documents using VueScan, which can add an OCR layer to PDFs automatically, but that only works if I have a physical document — I am out of luck if I want to add OCR to an existing PDF. This tool looks like it can help.
I have some notary documents from when I bought a house. Searching through it manually is a hell, not only because notaries use obscure language and the document is huge, but also because they format the document in a way that's safe against edits. I have longed for a way to Ctrl-F it.
Is there a recommended way to manually review and correct OCR generated like this? I've tried different methods over the years but nothing makes it easy enough, and tools that deal with hOCR all seem to be broken in different ways.
As mentioned in the other replies, Google's automatic OCR is limited. OCRmyPDF is designed for PDFs. So if you download a 1000+ page public-domain dictionary off of Archive.org (which is something I do regularly), and you want to re-run the OCR because Internet Archive doesn't tune its OCR very well for multilingual works (if it all), then OCRmyPDF is going to beat Google's automatic OCR every time.
However, I recently paid a programmer to fork OCRmyPDF to give it the option to use Google's OCR engine instead of Tesseract. That fork is here: https://github.com/ualiawan/OCRmyPDF. It's more fiddly than the regular OCRmyPDF, and it requires a Google Cloud Vision account (which charges some fraction of a cent for each page OCRed), but it works well, and in some cases may produce better results than OCRmyPDF, although you must be sure to specify the language of the document.
Very cool - I'm interested in your(?) fork of OCRmyPDF. But can I ask..Why a fork and not a plugin to OCRmyPDF? which supports adding new OCR backends.
Fun fact: even Gmail OCRs image files. If you search for emails using keywords, it'll use the OCR results. I just wish there was a way to get access to the OCRed text.
Not exactly what you're asking for, but Google Keep has a "grab image text" menu item for any note that has an image in it. I assume it's using the same backend.
I think I looked at using Google Drive for OCR, but it was limited to 2 meg files or only a few pages or something? I had a lot of higher resolution pages (1k pages, 300 or 600dpi), so this was a non-starter for me.
I think you're underselling ocrmypdf, which I use heavily:
1. Scan-only PDFs are shockingly common: I download papers all the time which are scan-only.
2. Non-scan-only PDFs often have garbage OCR. I assume this is because they were done long ago and never redone since. (Tesseract has gotten a lot better over the years.) Not terribly rarely, I use ocrmypdf to forcibly redo OCR layers because they are unusable.
3. ocrmypdf supports JBIG2, and possibly for other reasons as well, generates smaller PDFs; this is true even for 'native' PDFs or ones with good OCR. I routinely see PDFs I download, hot off the presses just days before with presumably their latest and greatest publishing stack from major scientific publishers, which shrink by a third or a half or sometimes wind up as much as 10x smaller. Not being a PDF expert, I have no idea how they manage to waste so much space, but they manage it. I also found that standard scan tools I was using like imagescan or gscan2pdf or 1dollarscan were not producing as small PDFs as ocrmypdf did.
4. ocrmypdf will also write PDF/A by default. It's true that most PDFs you download or create will probably be perfectly readable 50 years from now with no special effort... But it's nice to have that extra bit of archival compliance.
I agree on all points. I use the following one-liner in directories of PDFs to reduce their file size while retaining dimensions, not hurting readability, and keeping the embedded OCR text in place. It skips re-running the OCR. It's basically a recipe from the docs, I believe.
Yeah this would’ve been much more exciting for me had I discovered it 5 years ago (its git log starts in 2013). But having a CLI command that I can script will be nice for the occasional project of random scanned papers, or personal documents that I don’t necessarily want in the cloud even if the OCR is free
In case if anybody is looking for a reliable and free swiss army knife tool for PDFs, I'd like to recommend https://www.pdf24.org/en/ - it supports splitting, converting, signing, OCR, and many other operations. Both online and offline tools are available. Not an opensource product though and offline version is Windows only.
I use it together with my Nextcloud.
There is a workflow integration so that it all works automatically for uploads in specific folder (for me).
Example: I have a folder for invoices where I upload scans of important ones and all if them get automatically OCRed.
I think a lot of the comments are missing the PDF in the name OCRmyPDF. Having a software package that combines several open source tools to disassemble a PDF, do cleanup on each page, run OCR on them, and then reassemble them into a new PDF with the text embedded is why OCRmyPDF is so great. I see lot of worthy software also being mentioned in the discussion but a lot of it requires rolling your own cleanup steps or our reassembling steps because the packages are designed for single-image scans or only deliver JSON output (which requires some fairy dust to get back properly into HOCR and then into a PDF). As I posted above, I got someone to add Google Cloud Vision as an alternate OCR engine for OCRmyPDF — I'd love to see more OCR engines (like ones suggested here) made possible.
I use OCRmyPDF on a regular basis to OCR journal articles my library sends me.
I've found it works great on English but (with appropriate language packs installed) works poorly on Greek and Hebrew. It also makes no effort to understand the layout of pages (e.g., tables).
The project is fantastic, though. I've often considered building a web frontend that cleans up PDFs and then OCRs them using OCRmyPDF.
I use OCRMYPDF nearly every day. It already has cleaning functions: deskewing, page rotation, despeckling, contrast, etc. The docs linked above show all the many useful options in full.
If you want to OCR a document image, modern versions of Tesseract can work well. If you last used it a few years ago, the recognition has improved since due to a new text recognition algorithm that uses modern (deep learning) techniques. Browser demo using a modern version: https://robertknight.github.io/tesseract-wasm/.
OCR processing typically consist of two major steps: detecting/locating words or lines of text on the page, and recognizing lines of text.
Tesseract's text recognition uses modern methods, but the text detection phase is still based on classical methods involving a lot of heuristics, and you may need to experiment with various configuration variables to get the best results. As a result it can fail to detect text if you present it with something other than a reasonably clean document image.
Doctr (https://github.com/mindee/doctr) is a new package that uses modern methods for both text detection and recognition. It is pretty new however and I expect will take more time and effort to mature.
Thanks for posting. I immediately tried the browser link, and although the uploaded image has quite a decent quality, I'm not getting the results I'm looking for. Perhaps my expectations are too high?
Thanks for this test case. When I drop that image I see that the individual words are recognized correctly, but starting from about mid-way though are not displayed in the correct order in the text box at the bottom. If the image is rotated so that the text baselines are horizontal (about a ~1.5 degree rotation), the words are displayed in the correct order. So it looks like smarter methods or defaults are needed for the layout analysis.
I think with modern methods it ought to be relatively easy to teach a system to predict the amount of rotation needed to straighten the image, or make the layout analysis tolerate minor rotations of the input better. Needs someone to actually implement it though!
ocrmypdf --deskew --clean-final --output-type pdf --tesseract-timeout 600 --force-ocr -l eng --jbig2-lossy --optimize 3 /Users/username/Desktop/C1jn2Kz.png.pdf /Users/username/Desktop/C1jn2Kz.out.pdf
generates a PDF with this text for me:
Bad Ul is causing people to get scammed
2022-07-08
If you've asked anybody who's tried to sell anything on Facebook Marketplace, Offerup or
Craigslist, | can guarantee you that every one of them have encountered somebody trying to
scam them. I've encountered quite a few but I'll explain how this particular scam works and how
bad UI contributes to scammers being successful.
I opened the page, didn’t recognise the image you posted as the actual thing and immediately bellow it was a video of ice cream bars being made and I immediately imagined how could you expect the OCR to figure out that and read it as “vanilla”. :-)
I think that’s what Evernote uses and it’s one of the features that keeps me paying year-after-year. It seems to be able to index just about anything I upload including photos of handwriting.
http://www.tobias-elze.de/pdfsandwich/ has been my tool for years to OCR old research papers and books. It uses Tesseract and some other tools to get the job done.
PDF Arranger to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface: https://github.com/pdfarranger/pdfarranger
Did you come across any documentation that you are able to share while building yours? I'm interested in building one without relying on DL methods and comparing.
I just made it up, because I was unhappy with some of the seemingly obvious and strangly random errors by the ML that I couldn't sort out (and the size of model data etc), and it worked out way, way better than expected.
I tend to think some problems attacked by ML are possibly also directly solveable if the problem is understood in sufficient detail - sometimes it seems like, "we have a problem we don't fully understand (or can explain), so we will just train a model by example".
Obviously sometimes it's just depth of indirection and degrees of freedom beyond comprehension, but there still comes the problem of explainability if you have a regulator like banking, or medicine etc - how do you know and can explain exactly what it's doing or when/how it might mislead you on certain edge cases.
Anyhoo, in this case I knew it would work, but accuracy was well up into my best hoped expecations on first pass.
Basic recipe:
1) Convert PDF (to .tiff or internally to numpy), at least 300 DPI if you can.
1a) Gaussian blur and threshold as you wish, not always needed, then make an inverse (white on black) copy for contours and bounding rectangle finding.
2) Use opencv to find your bounding rects for your characters, sort them for order as the contours will not be in reading order.
3) Do a dummy run and use the bounding rects to excise out individual characters and write standardised images to disk. Then select your best for each character variant and rename them A.png, B.png etc. (You can use Tesseract to help here to save time and then hand fix any errors, but it's just a once off use). For maximum speed you could stash these on a ramdisk, but if you have enough RAM I am guessing the system will chache them anyway so maybe no point, I'm on PCIE 4.0 with 6Gbs disk reads, so don't really care.
4) When you want to OCR, just brute force it by getting your characters into individual standardised images in turn, resizing them to match each reference bitmap exactly as you cycle thru the checks (otherwise XOR fails). Tip seems best to always re-size the wider one to be size of the narrower one for better accuracy. You can also discard at this stage for obvious no matches based on dimensions, eg the diff between a I and M or W is obvious based on aspect ratio, so you can skip a few XORs if you want.
5) Pixelwise XOR the test OCR char against each ref, if it's a match you should get left with just a thin outline where the two don't exactly match.
6) Each time count the number of black pixels (or white depending if you were inverted or not) for the XOR result.
7) Lowest number of pixels is your match, simplistically, but see down for refinements.
8) To sort out B vs 8 and I vs 1 etc, you can go two ways:
8a) Store multiple B's and 8's as references (eg B_0, B_1 etc) and average, and/or look for lowest residual of the XORed pixel count across multiple minor variants of the same reference character, sort of a half arsed decision forest. Especially if you have scanned docs this can work well.
8b) For the selected problem cases you can also do an extra sub-check - eg you get and 8 or a B, do an extra check. Focus on a relevant sub region of the characters, eg for 8 and B the middle band (row) of plus/minus 10% of the centre is the salient difference zone, excise this region and run XOR again against each likely reference, the diff in XOR pixel count will be magnified (as a ratio) and you will be able to select this way.
When you do the bounding rects for the characters you will likely pick up any residual noise, punction etc, especially if scanned docs - you can easily split these out for discard or special attention by testing against volume (simple w * h or contour volume), and/or any disproportionate ratios of w and h, or even position relative to other characters.
It works surprisingly well on different fonts not the actual reference font, to a point, obviously YMMV, but a bit of dilate and erode all around helps this.
One advantage is that it scales automatically for different font sizes, so on an engineering drawing where it's typically all the same font on the sheet and the entire drawing set I can grab title block info, notes etc all for no more effort.
You can "train" specifically for special characters or fonts as needed, it will do multiple fonts in one go, best match wins, it will just slow down a little as there will be checking of more possibilitities, for each character.
If you get the right, or near right font, you can generate a checking print by writing over the top of your OCR'ed text in say red (or yellow on light blue, etc) as you go, and then you can check for any errors on a page at a near glance. Or XOR back over the top of original with found, blur, then threshold.
If you are scanning text of known form you can obviously use some regex, or in my case certain things should be fixed number of chars or unique, so I can check for missing characters or double ups etc to flag. But typically I have found there are just a few specific problem confusions and the can be addressed as above as needed and then it's near 100%.
So that's it, maybe you can come up with some more improvemets or automation of the setup.
I was originally going to look at varying the blur of reference chars, or minor variations to aspect ratios, or positioning in the character frame, to be applied as multiple runs to try and run a sort of sensitiviy analysis to home in on the best matches or problem areas, but found even with scanned documents I had no need, so never bothered.
While dreaming intially I also considered some sort of algebra/trignometry/stats on the contours for problem characters - bother vector angle and volume enclosed, but once again had no need. But things like look at ratios of types of contour vectors for a single character, eg straight vs curved (a C is all curved, a T is almost all straight) by comparing to adjacent contour vectors, plus also ratios of volume to circumferance, but once again no need in the end.
Finally, could also use the opencv blob matching, I never got that far in OCR use as I have yet to need, but I do use it for finding regions of inerest as I am doing engineering diagrams - eg P+IDs which are highly coded symbolically and I am mostly interested in symbols, then tags enclosed (alpha + numeric chars) and modififying characters next to symbols.
It sounds like a little work when I write it down now, but really using Python at a semi experienced level, and not really having used opencv much before, it was maybe a lazy weekend and a few nights to have something pretty solid for my own use - but still too shameful code quality to publish.
Luckily the only code review involved will be a self review between me and the dog...
Once set up for a standard font, or generic fonts, it's done, so the ROI goes up with use/time. The only other thing I might do yet is to try and integrate it with right click for a selected window area, into the clipboard, but not sure if it is worth it compared to other already available options.
This is amazing - thank you so much for sharing this, people like you who go to great depths for random internet users make this world a much better place.
I'm going to attempt to use this methodology on a pet-project for components on PCBs.
Do they edit the pdfs, or do they index them so you can search? OCRmyPDF does the first, so it's usable elsewhere; glancing around, it looks like Dropbox does the second, which can't be saved for use elsewhere.
A quick note that many government agencies in many countries like to avoid complying with the spirit of transparency laws by posting scan-only PDF files. Batch OCR is an incredibly valuable tool for activists.
A related option: LibreOffice has a cool option where you can generate the PDF with the editable text format embedded. You get a clean PDF that is also fully editable. Easy tech, but also useful.
I think Abbyy kills it, they really are Enterprise grade, unlike Adobe who like to claim they are but crash more often than Taki Inoue ever did.
For standard grade I find masterpdf kills Adobe almost all around, including OCR and search/index, and it lets you edit pdf elements more easily than editing a word document. Not FOSS, but I found it well worth my$50 or whatever it was, esp compared to Adobe and Foxit. Crashes way, way less than Adobe as well, can also get a linux version, which you can't with Adobe.
I have a full Adobe DC license supplied by work, but the only time I use Adobe by choice is if I want the outrageously useful and uniquely packaged functionality provided by the Autobookmark (Evermap) or Debenue plugins, but tbh they only really do things that Adobe should have baked in anyway if they weren't trying to gouge you on basic functionality they lock up (eg create bookmarks based on a font, page position and a regex, and then make TOC from bookmarks - lets you create a TOC for a pdf that has none and you don't have the native, for instance).
I've been doing a deep dive in to pdf format for some time now.
I'll check it out. I haven't tried Abbyy yet but have tried a few others, and none of them unfortunately came close to the capabilities of Adobe Acrobat. I don't like Adobe Acrobat because I primarily use Linux, but do keep a Windows VM or partition around to scan documents.
Really!? I have found Acrobat to be pretty awful but I work with complex documents (lots of columns, weird forms, etc).
I have found ABBYY to the gold standard. Thankfully, that's also the engine built into a number of other bits of software, including DevonThink and some PDF readers.
I have just given up on Tesseract for parsing engineering diagrams where the one single font is used throughout and rolled my own non-ML OCR using opencv that requires a little setup but easy hits 99.9% on a character basis and is way quicker, plus uses way less data stored.
I couldn't get Tesseract to pick Z against 2, I against 1 etc reliably when no serifs, no matter what i did for preprocessing - dilate/erode, threshold, border, contours with smoothing, x2 and x5 resize, all the modes and so on.
Because it's engineering diagrams only limited contextual guidance can be used, it's basically independent characters each time.
They say retraining Tesseract won't help, but surely it would for certain fonts - what am I supposed to do about the 1 vs I then, as one example? I could pre or post process, but once I set up to do that for a few specific cases I can do a little more and then do the whole alphabet and get way better performance without Tesseract at all.
And then there is the stupid Autocad font designed for pen plotters like HP-7585 or HP-7475 that no one owns any more, but the font with the square zero lives on, and that is a problem on it's own.
Anyhoo, I tried out the OCRmyPDF on a few things, it could be useful in the right circumstances, but by no means magic. Nanonets or pdfsumo seem to do way better, plus can handle tables.