Digitizing the 1859 Frederick Md. Directory

Converting the 1859 Frederick Md directory to a searchable PDF document was a learning experience for me which I would like to share.  Photographing the directory went fairly quickly.  There were ninety-two images each showing  two opposing  pages plus images of the front and rear covers.  Working alone, I completed this step in under an hour.  Using the hand-held remote “clicker” was a big time saver.  I could just turn a page, hold it down with my finger “tool”, and click.

I  saved the original Nikon Raw files in TIFF format and converted the TIFF files to searchable PDF with Nuance Omnipage Professional 16. I happen to own this rather pricy application because I took advantage of deep discount offer a year or two ago. Omnipage Standard 18 may be a reasonable alternative.  I want the PDF files to be in Image Over Text format.  This displays the original image with searchable text underneath.

On my first attempt, I ran into a rather frustrating problem trying to save in that format.  I selected the “PDF Searchable Image” format, but Omnipage saved in converted text format with no background image.  I was a novice with this application and looked everywhere for a solution with no success.  Finally, I opened the “Options” dialog for the “PDF Searchable Image” format choice.  In there are two check boxes, “Show Background Image Layer” and “Show Text Layer” which both must be checked.  This solved the problem.

With this basic function fixed, the next step was to perform Optical Character Recognition (OCR) on the TIFF image files. I chose to process them a file at a time and save the output as separate PDF files.  (I merged the files later.) I chose not to automatically locate the text and images and instead manually created text recognition boxes.  I did not need to deal with images because I use the Image Over Text format. This is all explained in the on-line help.  New users should expect to spend some time overcoming the rather steep learning curve. This effort, however, was rewarded by good conversion accuracy.  By contrast, I made a brief attempt to OCR the document using Adobe Acrobat 8 Professional without any early success.  After performing the OCR, Omnipage pops up a “Proofreader” box with “Recognition Suspects” highlighted.  You can specify the correct text behind suspect graphic images.  There is a “training” function that remembers your identification.

I had a problem with Suspect Text Areaareas like this one. The large “B” is called a ”drop cap”.  Drop caps were repeatedly  not recognized and the training function did not seem to work for them.  If anyone has a solution, I would love to hear from them as this was the primary source of error.

I finished processing by merging the individual PDF files using Adobe Acrobat 8 Professional.  The finished searchable document is now available in the archives of the Historical Society of Frederick County. There are enough directories and similar documents there to keep me busy for some time.

Now that I can extract the text from directories, I would like to include the directories in  web searchable documents.  I am currently studying methods of accomplishing this goal.

 

 

About John

John Reynolds (aka PixTraveler) worked in the satellite communication business as a physicist and later in the IT world as a software engineer. He is now free to be a grandfather, travel, volunteer, read, take courses, and work as an independent software developer. He is working on PixSafari, an application that makes it easy to geolocate and map photographs. He other interest include American History and photography. You can visit his website and blog at reynsoft.com.
This entry was posted in Historic Documents, Historical Society of Frederick County and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *