OCR Files and Columns
Optical Character Recognition (OCR) software is an emerging and powerful technology that can, when used with proper care, be an immensely helpful research tool. Researchers can take the images of type-set documents and convert them into searchable text. If the quality of the OCR is high enough, researchers can even copy and paste the text into their notes, or quote it in their publications.
I’ve been running OCR on several images that I’m using for another project – The Confederation Debates. Much of the text for this project is being reproduced, with permission, from the Canadian Parliamentary Historical Resources website. When I ran a test-page through several OCR software programs, I noticed a persistent problem that spurred from my viewing program (rather than the OCR software).
Not all PDF viewing programs are created equally – particularly when it comes to columns. I personally work on a Mac, and I love several of Preview’s features (the default PDF viewer for OSX). Preview, however, has difficulty with columns. Take for example the image from the House of Common’s debate on the creation of Alberta and Saskatchewan in 1905. When I select multiple lines of text in Preview, it often reads straight across the page rather than recognizing that the paragraph only occupies half of the page’s width (see image below).
With Preview, I can still search for keywords, but it is very tedious to copy and paste individual lines correctly.
But look what happens when I open the exact same file in Acrobat Reader (below).
Highlighting a single column and copying whatever text I like is easy. Again, this is the SAME file. Acrobat, I should note, made the same error as Preview when I tested additional pages. Preview, moreover, sometimes OCRed columns well. But Acrobat is, on average, definitely better than Preview.
The two programs apparently read the text layer differently. As a result, I keep both programs handy and use them according to their strengths.