Reading and extracting Text out of Adobe Files

The sighted world generally loves, or at least tolerates PDF files. Blind users though can have problems reading pdf files, or getting plane text out of them. Find out how you can make pdf files easier to work with.

First of all, we wouldn't have .pdf files if it wasn't for the adobe reader software. If you happen to be looking for the latest, and purportedly greatest copy of the adobe reading software just follow this link to get Adobe reader. this release looks as if it will have a positive impact on access technology's ability to interpret pdf files.

 

The guys at Adobe have created a guide to using adobe files with screen reading technology. You can get the Adobe file here, or the Text version of the guide here. Just remember that the text only version of the guide will not look right in some areas such as the table of contents due to formatting being lost as a result of the conversion process.

What's new in Adobe reader 9.0?

 

The below blurb is taken directly from the Adobe what's new page for accessibility. take a look. You may want to upgrade to the latest version of Adobe reader before trying some of the ideas in this article.

 

Adobe Acrobat 9 software includes new accessibility features that improve the performance of Acrobat when it interacts with assistive technologies such as screen readers.

New accessibility features available in Acrobat 9 include:

  • Table Editor tool in the TouchUp Read Order panel that facilitates the evaluation and repair of PDF tables for accessibility. The Table Editor tool features a user interface that provides users with an immediate indication of the scope and role of cells in tables within PDF files.
  • TouchUp Read Order tool improvements that enhance support for editing multiline text blocks, and new Unicode-based text editing that in addition to the ISO Latin character set now provides direct support for Central European languages, with more to follow.
  • Improved support for dynamic PDF forms that have been created using Adobe LiveCycle Designer ES software. As tables render themselves according to the responses provided by users, the document's dynamic rendering is communicated to assistive technology.

Adobe and Screen Readers

Many still say that accessing pdf files is a pain. This remains true even though screen-reading technology such as System Access, Jaws, and Window-Eyes have made pdf documents accessible for the most part. In fact, Adobe is also committed to accessible pdf documents as showcased by their accessibility wizard that automatically detects if a screen reader is being used, and runs a wizard that helps computer users set Adobe reader for maximum compatibility with the screen reader of your choice.

Despite these laudable efforts, many still prefer not to have to deal with Adobe reader, and/or pdf files at all, and this is what the next section is precisely for.

Getting Text from PDF files

Finally! We get to the good stuff! One of the most direct ways to get text out of a pdf file is with adobe reader itself. If the security settings specified by the author of the pdf file allow saving to text you can bring up the file menu with alt+F then arrow up to “Save as text”. You will then get a regular save as dialog box in which you can save the file in a location on your hard drive, and type in a file name of your choice. As an example, you could type in something like c:textsearching Google more Effectively.txt, in the save as file name box, and then press enter. If you have no other software available, this can be one of the easier ways to extract text out of the pdf file. What you get is a no thrills plane text file. Of course, you can then open this file within notepad or any other word processing application. But wait! That’s not the end of the story…

Other Ways to Extract Text from PDF Files

One of my personal favorite free programs is called EdSharp by Jamal Mazrui. One of the great things about EdSharp is its ability to instantly convert many pdf files to plane text just by pressing enter on the pdf file! As you probably know, pressing enter on pdf files generally opens adobe reader. However, you can move to the file and press the apps key and arrow down to “Open with”. Then arrow down to either “choose default program”, or “choose program”, depending on the version of Windows that you are running. Tab over to the brows button, and then navigate to the EdSharp folder, and finally to the EdSharp executable, and press the enter key. Now if you press enter one more time, the computer will open that pdf file with EdSharp. But before you press enter here, you may want to check the box that says, “Always open files of this file type with this program”. If you check this box via the space bar, you can simply arrow to a pdf file and have EdSharp convert it into a text file.

Once EdSharp has done its work, you get a simple text file that you can arrow through, and save just as if you would save a regular notepad document. Just remember that this will not work on every pdf file unfortunately. But don’t worry! There are a few other programs that you can use to try to open more difficult pdf files. Before I get off the subject of EdSharp though, one more thing that is really handy about EdSharp is it can also treat Power Point files the same way that it treats pdf files. So you can get plane text version of power point files as well.

PDF to Text

This program is also a free program, and very accessible. As with EdSharp Jamal Mazrui is its creator. The big difference between pdf to text, and EdSharp is the fact that Pdf to Text can convert a batch or several pdf files at one time to text. So for example, say you had a folder with 100 pdf files in it. With pdf to text, you could just open pdf to text, brows to the folder with the pdf files, indicate where you want the text files to be saved to on your hard drive, tab to the convert button and get some coffee. The process could take a while if the pdf files are large, and if you have a lot of them. Pdf to text can also show you the pdf file in a read only text box. It can also optionally do OCR (optical character recognition) on a pdf file if needed.

These free programs aren’t working?

So you may need something with a bit more pdf busting power! There are two affordable alternatives. These programs are not free, but they have saved the day for me when I really needed to get some text out of a locked pdf document.

If you visit Reading Made Easy among other great software you will find PDF magic. This program can convert pdf files to several different formats such as text, rtf, and html. You can also retain formatting such as bolded text, and headers and the like. Pdf magic can also do optical character recognition and it does this automatically for you.

PDF Equalizer is similar to Pdf magic and made by the same company Reading Made Easy. The big difference between pdf equalizer and pdf magic is that pdf equalizer does not do any conversions. What this means is that pdf equalizer is like a viewer. You can see it, but that is about it. The program will not convert even to text. That may be ok for you though. At least now, you have some choices when it comes to using pdf documents.

Please feel free to comment either via the Blog, or via email at Help