Forum Moderators: Robert Charlton & goodroi
In the past, scanned documents were rarely included in search results as we couldn't be sure of their content. We had occasional clues from references to the document-- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format.
For example, some major sites put boilerplate disclaimers into an image to avoid indexing problems. I hope OCR doesn't complicate things.
I'm still not about to recommend images instead of text documents, but as a searcher it might give me access to information I've been missing.
I confess to being somewhat underwhelmed by Google's effort to index "non standard" web content. I dread to think how many times I've searched Google, but in all that time I still shy away from clicking on stuff like Word documents or PDFs. If I'm explicitly looking for that content then it's great, but I'll be looking for [filetype:pdf] [google.com] by that point ;)
Jason Kincaid, a blogger at TechCrunch noted that: "Such technology has existed for quite a while, but accuracy has always been an issue -- and the fact that Google is doing it on such a massive scale makes it a very impressive accomplishment. It also opens the doors to much more thorough searching, especially for content that is often found in printed documents (like academic papers)."ComputerWorld Article [computerworld.com]
As the linked articles indicate (and your own experience can verify) accurcay in OCR is still a difficult problem. The Google results from this new adventure are most likely not going to be ideal for quite a while. If you don't want mismatched information, or coffeee stains being turned into text, then make sure you take some helpful steps.
Muttering...
[catalogs.google.com...]
I hope this works better than the OCR I've become accustomed to
They've been using this for Google Books, which are images presented in PDF format, for quite some time. Like any OCR, accuracy depends a lot on the font and the text. If the text contains standard dictionary words in a relatively clean font (not old paper with strong serif fonts) the results are decent. Otherwise, poor.
I'm curious that this is news, because PDF scans from Google books have been text-searchable for a long time and you have been able to view the text version for at least a year. It is these text versions that have formed the basis of the book SERPs and, with universal search putting book results in the general results, OCR of text image scans have been showing up in the general search results for quite a while too.
I guess this announcement means that they're expanding that usage beyond Google Books to documents discovered "in the wild"? Or is this something different?
I offer several pdfs, put in that format for a REASON. Just checked google and see that for two of those they offer ViewHTML for those titles. Looks like farmer's friend. Worse, the "title" is totally fubarred...and was used for the PDF listing! Not liking this at all.
Nor am I tangor!
I've many files that could easily be viewed as text or html, where that my original intent. That I at least offerring a viewing option in the PDF format is beyond a necessary comprehension and/or explantion reason for Google or any other bot.
Has anybody seen their PDF's that have been password encrypted, being OCR'd by Google and listed in SERPS?