Parse File Attachments with Apache Tika

I have been up to my neck in various Drupal search modules/configs/nightmare scenarios for almost a month now. But since Google has set the bar as high as they have, search must be easy, fast and accurate.

If you have the resources, an Apache Solr server is probably the way to go. But if you don’t have the infrastructure for that, there are still a lot of options out there. After working with the Search Files search mods and the native drupal search, I have decided to go with The Search API module with some of its submods – specifically, the Search API Attachments mod. This will allow me to parse attached documents in several different formats, including PDF, the one that I am mainly concerned with.

To parse attachments, you have to have some sort of a helper app installed. In the case of the Search API Attachments module, that would be Apache Tika.  Plus, Tika is also needed for Solr Server so this doc also serves that end as well.

Here are some of the prerequisites:

  1. Java 1.6 – http://xmodulo.com/2012/05/how-to-install-java-16-in-linux.html
  2. Apache Maven – http://xmodulo.com/2012/05/how-to-install-maven-on-centos.html
  3. Tika – the source from which the .jar will be compiled – http://tika.apache.org/0.7/gettingstarted.html

Once you have the prereqs installed, you can run:

mvn install from the root directory of the tika files. This will run for about three minutes and will end up compiling a nice .jar file that you’ll need.

http://www.acquia.com/fr/blog/use-apache-solr-search-files – read this great article on how to install Tika specifically for Drupal.

Once the compile is complete, you can test the parser by running this command:

java -jar ./tika-app/target/tika-app-1.6.jar -t /var/www/html/sites/all/pdfs/test.pdf

**NOTE THIS PATH REFERENCES THE LAYOUT OF MY SERVER**

If all the text from the PDF goes scrolling past your screen, you have everything installed correctly from the Tika/OS standpoint.

Now, configure Drupal to use the Tika install and you’ll be rolling.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: