PDF to TXT

I’ve procrastinated for so long. Let me write about what I’ve done so far.

The first step I did was to convert the SANDAG public comments from pdf to csv. Although there are some open source tools for pdf to csv conversion tasks, the file I’m converting exceeds the file size limits. Therefore, I chose to convert pdf to txt first, and then wrote python code to convert the txt file to csv file.

Since there’s only one file, I simply used Adobe’s pdf to txt function:

1

Later, I had another task to convert meeting minutes from pdf to txt, but this time instead of converting one big document, I had to convert several hundred small documents.

I followed the steps on my classmate’s blog. She has written a very detailed description.

  1. Install python 2.0
  2. Download pdfMiner from this link, which contains pdf2txt.py which extracts text contents, as well as dumppdf.py, which dumps internal contents of the pdf including images in pseudo-XML format. I only used pdf2txt.py for my tasks. Here’s the index page for pdfMiner.
  3. Extract and install. Test with the sample coming with the package:
    python setup.py install
    $ pdf2txt.py samples/simple1.pdf
    Hello
    
    World
    
    Hello
    
    World
    
    H e l l o
    
    W o r l d
    
    H e l l o
    
    W o r l d

    You may need to change directory. Also, for those new to python, I’m using anaconda for platform and sublime for editor.

  4. Now, put everything you want to convert to one directory, and write a .bat file. Mine looks like this.
    cd /d C:\My Programs\Python\NLP\PDF Converting
    
    @rem optional: add "cmd /k" in front to keep terminal window open
    for %%f in (.\executive_minutes\*) do ^
    python pdfminer-master\tools\pdf2txt.py ^
    -o executive_minutes\txt\%%~nf.txt executive_minutes\%%~nf.pdf

    The first line changes disk and changes to the directory where I put pdfMiner directory “pdfminer-master” and files directory “executive_minutes”. “^” is line breaker for batch.”~n” means the filename without extension & directory. This Stackoverflow answer contains a more complete list of how to iterate all files/subdirectories/etc in a directory in batch. Lastly, I’m only using -o output option of pdf2txt.py, but a complete list of options can be found here.

  5. Run the .bat file. And wait for your fruit! Yay!

*Another reference blog post to look at for manipulating PDF docs.

*Masha also included the way to use pdfMiner in python in her blog.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s