Import PDFs into Drupal

I have about 19,000 PDFs to get into Drupal. Yeah, fun. And there aren’t any easy ways to do it. There don’t appear to be any modules that support this. So, I had to get a little creative. And I have made it work.

Biggest caveat – This doesn’t bring a nice looking PDF into Drupal preserving all the wonderful PDF formatting. It really will just bring in the textual content. Which suits my needs because the text is all I really need. This isn’t a great solution and it may be the biggest one off in my career, but if you need what the PDF says, and not how it looks, this will work for you.

One of the issues here is that the PDFs have a lot of weird formatting in them. Many are actually scans of decades old paper court documents. They wind up with all sort of page breaks, table formatting and other oddities.

This is an overview of the process – high level.

  • MS Windows for the OS of the client doing all this
  • Convert PDFs into DOCs. Great utility – boxoft.com *
  • Use MS Word to remove faulty formatting
  • MS Outlook to email docs in the body of the email
  • Hotmail for the email transport to an IMAP mailbox on QMail **
  • Mailhandler Drupal Module to receive the doc
  • Automate the process with  Macro Recorder from http://www.jitbit.com ***
* Freeware
** can be any Mailhandler enabled mailbox
*** not free, but a great product with a generous 40 day trial

I’m using MS product for good reasons. I did try to make this work with Open Office but it doesn’t have the features that I need.

MS Word, used with Outlook and Hotmail, will allow you to send the doc in the body of the email easily, not as an attachment. I looked for attachment handling but I didn’t really see one. Once the document has been sent in the body, it essentially loses its MS Word attributes and become simple formatted text. So, it can be easily processed by Mailhandler.

Hotmail is necessary, QMail is not. You just need Mailhandler to be able to receive emails and turn them into nodes. That is a project unto itself that I covered here a few months ago.

The Macro Recorder is for the automation part. What I was able to do was create a “map” or “procedure” of sorts consisting of keystrokes only (well, one mouse click, but no mouse movement) that is consistent for each doc. This “map” opens the file from a window already opened into MS Word. The map has tabs, backspaces, arrows, and key combinations that are consistent every time. If you don’t know keyboard shortcuts you’ll need to learn them. I suppose that you can

Image

use the mouse more, maybe for the whole thing even but I have used macro recorders before and they are finicky and I believe that they deal best with the keyboard.

This process will require refinement: you’ll have to play with it. And it is slow. I am currently sending about 3 docs per minute. so it is going to take about (19,000 total) two weeks. But, it is a one off. And it is going to be really valuable to have the data so for me it is worth it. and once you get it moving, it doesn’t require much in the way of babysitting.

The upper image is the original,

Image

the bottom is the result. Not great, but it gets the job done!

Advertisements

One Response

  1. Really no matter if someone doesn’t understand then its up to other users that they will help,
    so here it occurs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: