OCR Files and Columns

OCR Files and Columns

14:13 19 January in Digital Research Tricks

Optical Character Recognition (OCR) software is an emerging and powerful technology that can, when used with proper care, be an immensely helpful research tool. Researchers can take the images of type-set documents and convert them into searchable text. If the quality of the OCR is high enough, researchers can even copy and paste the text into their notes, or quote it in their publications.

I’ve been running OCR on several images that I’m using for another project – The Confederation DebatesMuch of the text for this project is being reproduced, with permission, from the Canadian Parliamentary Historical Resources website. When I ran a test-page through several OCR software programs, I noticed a persistent problem that spurred from my viewing program (rather than the OCR software).

Not all PDF viewing programs are created equally – particularly when it comes to columns. I personally work on a Mac, and I love several of Preview’s features (the default PDF viewer for OSX). Preview, however, has difficulty with columns. Take for example the image from the House of Common’s debate on the creation of Alberta and Saskatchewan in 1905. When I select multiple lines of text in Preview, it often reads straight across the page rather than recognizing that the paragraph only occupies half of the page’s width (see image below).

OCR sample page - Preview

With Preview, I can still search for keywords, but it is very tedious to copy and paste individual lines correctly.

But look what happens when I open the exact same file in Acrobat Reader (below).

OCR sample page - Acrobat Reader

Highlighting a single column and copying whatever text I like is easy. Again, this is the SAME file. Acrobat, I should note, made the same error as Preview when I tested additional pages. Preview, moreover, sometimes OCRed columns well. But Acrobat is, on average, definitely better than Preview.

The two programs apparently read the text layer differently. As a result, I keep both programs handy and use them according to their strengths.

Share this post with your friends and co-workers
Tweet about this on Twitter
Share on Facebook
Share on Google+
Email this to someone
The following two tabs change content below.

Dan Heidt

Dan obtained his history PhD from Western University in 2014, and is currently a postdoctoral fellow at Trent University. His studies led him to innovate new and efficient techniques to economize academic research. As a co-founder if WI he continues to hone photography, organizing and analyzing strategies for social science, humanities, and science researchers. Over the years, he worked or co-supervised half a dozen Research Assistants. His research to-date has taken him to over a dozen archives located across Canada, the United States and England.

Latest posts by Dan Heidt (see all)

No Comments

Post A Comment