Apache Solr, indexed attachments and search through Drupal Views 3.x

I now have my Apache Solr Server setup and it appears that all the tales you hear in the Drupalverse about Solr being the way to go are true. But there is quite a bit of work that goes into a Solr setup unless you decide to take an outsourced route – Acquia for example.

Search for Drupal may be the least documented feature of the whole CMS. Which is kinda weird considering how important search is overall.

I plan on putting together something fairly comprehensive for the setup and implementation of Solr for Drupal including the indexing of attachments (15k PDFs, in my case) and search UI from Drupal Views 3.x. But for now, here are a few of the highlevel items for Solr running on Centos 6.4 Linux.

  • Apache Tomcat
  • Java
  • Tika (to parse documents)
  • Search API
    Solr search (for Search API) 7.x-1.2

    or

  • Apache Solr framework7.x-1.4
        Apache Solr search 7.x-1.4
  • Attachments
  • Facets
  • Highlighting
  • Drupal Views with exposed operators for flexible searches

To be honest, it’s kind of a mess and there are many of other options than what are listed here. And the documentation is not as thorough as it might be.

Apache Solr, indexed attachments and search through Drupal Views 3.x

I now have my Apache Solr Server setup and it appears that all the tales you hear in the Drupalverse about Solr being the way to go are true. But there is quite a bit of work that goes into a Solr setup unless you decide to take an outsourced route – Acquia for example.

Search for Drupal may be the least documented feature of the whole CMS. Which is kinda weird considering how important search is overall.

I plan on putting together something fairly comprehensive for the setup and implementation of Solr for Drupal including the indexing of attachments (15k PDFs, in my case) and search UI from Drupal Views 3.x. But for now, here are a few of the highlevel items for Solr running on Centos 6.4 Linux.

  • Apache Tomcat
  • Java
  • Tika (to parse documents)
  • Search API
    Solr search (for Search API) 7.x-1.2

    or

  • Apache Solr framework7.x-1.4
        Apache Solr search 7.x-1.4
  • Attachments
  • Facets
  • Highlighting
  • Drupal Views with exposed operators for flexible searches

To be honest, it’s kind of a mess and there are many of other options than what are listed here. And the documentation is not as thorough as it might be.

Additional Drupal Search Resources

I am really surprised at the dearth of documentation on Search options/modules on Drupal.org. So I am compiling a running list of different sites that have actual details on what you can do with Search and how best to achieve them.

https://drupal.org/node/343467 – Probably the best place for docs on Solr and Drupal, and all the mods that go with them.

http://envisioninteractive.com/drupal/drupal-7-views-with-faceted-filters-without-apachesolr/ – Great one for those who want to use the out-of-the-box search_db backend.

http://www.acquia.com/blog/simple-guide-install-apache-solr-3x-drupal-7 – One for Solr – which I believe I am going to end up using.

http://www.lullabot.com/blog/article/installing-solr-use-drupal – Installing Solr. GREAT TUTORIAL from LULLABOT

http://xmodulo.com/2013/02/how-to-install-apache-tomcat-on-centos.html – installing apache tomcat for Solr on Centos

http://www.mkyong.com/tomcat/how-to-check-tomcat-version-installed/ – Tomcat version

http://quark.humbug.org.au/publications/notes/bofh/msg00027.html – tomcat authentication

http://wiki.apache.org/solr/SolrTomcat – troubleshooting Solr

http://zugec.com/73-how-setup-search-api-apache-solr – installing solr and drupal info

Parse File Attachments with Apache Tika

I have been up to my neck in various Drupal search modules/configs/nightmare scenarios for almost a month now. But since Google has set the bar as high as they have, search must be easy, fast and accurate.

If you have the resources, an Apache Solr server is probably the way to go. But if you don’t have the infrastructure for that, there are still a lot of options out there. After working with the Search Files search mods and the native drupal search, I have decided to go with The Search API module with some of its submods – specifically, the Search API Attachments mod. This will allow me to parse attached documents in several different formats, including PDF, the one that I am mainly concerned with.

To parse attachments, you have to have some sort of a helper app installed. In the case of the Search API Attachments module, that would be Apache Tika.  Plus, Tika is also needed for Solr Server so this doc also serves that end as well.

Here are some of the prerequisites:

  1. Java 1.6 – http://xmodulo.com/2012/05/how-to-install-java-16-in-linux.html
  2. Apache Maven – http://xmodulo.com/2012/05/how-to-install-maven-on-centos.html
  3. Tika – the source from which the .jar will be compiled – http://tika.apache.org/0.7/gettingstarted.html

Once you have the prereqs installed, you can run:

mvn install from the root directory of the tika files. This will run for about three minutes and will end up compiling a nice .jar file that you’ll need.

http://www.acquia.com/fr/blog/use-apache-solr-search-files – read this great article on how to install Tika specifically for Drupal.

Once the compile is complete, you can test the parser by running this command:

java -jar ./tika-app/target/tika-app-1.6.jar -t /var/www/html/sites/all/pdfs/test.pdf

**NOTE THIS PATH REFERENCES THE LAYOUT OF MY SERVER**

If all the text from the PDF goes scrolling past your screen, you have everything installed correctly from the Tika/OS standpoint.

Now, configure Drupal to use the Tika install and you’ll be rolling.

Drupal Site indexing – MySQL Errors, CRON timeouts

Since I recently dumped almost 17,000 new nodes into my DB in a relatively short time, I have been keeping a close eye on how the back end is responding. The main concern that I have is the indexing process for the new data. I began receiving errors from MySQL as seen in the SS.

In the Drupal Admin UI, check the number of nodes that are indexed per

Image

CRON run. In my case, I had it set to 500, the maximum. This was a bit of overkill and I ended up making the value lower. I also increased the PHP memory, which you can check under “status reports”.

…PHP 5.3.27 (more information)
OK
PHP extensions Enabled
OK
PHP memory limit 512M
OK
PHP register globals Disabled…

I really had simply to tweak the settings. I ended up at 100 nodes per run, one run per hour and PHP mem allocation as seen above. From the Search Options UI, you can see the status of the indexing, how fast, how many nodes, etc.

Getting Aggregated RSS Items into Nodes

Well, I am throwing in the towel on this. I have made it work, but it just doesn’t work very well. too bad because, as nodes, the stories can be indexed and searched the same way that the rest of the site can. But all I can get from the current setup is a title and a description. I am going to leave the rss items as db items and use Views to search through them.

Search Module – Options and Plugins

Google proved it – Search is key. and the search that comes with Drupal is OK. But not great. I have some search features that I am using by exposing different Views’ filters to the site users. But that really isn’t good either. Melissa Mayer showed the world that simplicity is best (she designed Google’s nearly blank search page)

drupal search

and that lesson isn’t lost here. I need one way to search everything on my site. and that’s it. that is my quest for today and the last one that I will undertake before I start in on the actual design of the site with CSS and all that fun stuff. From the standpoint of function, this is the last thing to be done.

This is a good page to start with in regards to search. What is cool about the core search feature is the ability to extend the functionality via the checkboxes for additional search mods.   —–>

http://drupal.org/node/228411

What I really need to be able to search is the Aggregator items that I store. They are DB objects and not nodes which is what makes them tricky.