Apache Solr, indexed attachments and search through Drupal Views 3.x

I now have my Apache Solr Server setup and it appears that all the tales you hear in the Drupalverse about Solr being the way to go are true. But there is quite a bit of work that goes into a Solr setup unless you decide to take an outsourced route – Acquia for example.

Search for Drupal may be the least documented feature of the whole CMS. Which is kinda weird considering how important search is overall.

I plan on putting together something fairly comprehensive for the setup and implementation of Solr for Drupal including the indexing of attachments (15k PDFs, in my case) and search UI from Drupal Views 3.x. But for now, here are a few of the highlevel items for Solr running on Centos 6.4 Linux.

  • Apache Tomcat
  • Java
  • Tika (to parse documents)
  • Search API
    Solr search (for Search API) 7.x-1.2

    or

  • Apache Solr framework7.x-1.4
        Apache Solr search 7.x-1.4
  • Attachments
  • Facets
  • Highlighting
  • Drupal Views with exposed operators for flexible searches

To be honest, it’s kind of a mess and there are many of other options than what are listed here. And the documentation is not as thorough as it might be.

Advertisements

Apache Solr, indexed attachments and search through Drupal Views 3.x

I now have my Apache Solr Server setup and it appears that all the tales you hear in the Drupalverse about Solr being the way to go are true. But there is quite a bit of work that goes into a Solr setup unless you decide to take an outsourced route – Acquia for example.

Search for Drupal may be the least documented feature of the whole CMS. Which is kinda weird considering how important search is overall.

I plan on putting together something fairly comprehensive for the setup and implementation of Solr for Drupal including the indexing of attachments (15k PDFs, in my case) and search UI from Drupal Views 3.x. But for now, here are a few of the highlevel items for Solr running on Centos 6.4 Linux.

  • Apache Tomcat
  • Java
  • Tika (to parse documents)
  • Search API
    Solr search (for Search API) 7.x-1.2

    or

  • Apache Solr framework7.x-1.4
        Apache Solr search 7.x-1.4
  • Attachments
  • Facets
  • Highlighting
  • Drupal Views with exposed operators for flexible searches

To be honest, it’s kind of a mess and there are many of other options than what are listed here. And the documentation is not as thorough as it might be.

Additional Drupal Search Resources

I am really surprised at the dearth of documentation on Search options/modules on Drupal.org. So I am compiling a running list of different sites that have actual details on what you can do with Search and how best to achieve them.

https://drupal.org/node/343467 – Probably the best place for docs on Solr and Drupal, and all the mods that go with them.

http://envisioninteractive.com/drupal/drupal-7-views-with-faceted-filters-without-apachesolr/ – Great one for those who want to use the out-of-the-box search_db backend.

http://www.acquia.com/blog/simple-guide-install-apache-solr-3x-drupal-7 – One for Solr – which I believe I am going to end up using.

http://www.lullabot.com/blog/article/installing-solr-use-drupal – Installing Solr. GREAT TUTORIAL from LULLABOT

http://xmodulo.com/2013/02/how-to-install-apache-tomcat-on-centos.html – installing apache tomcat for Solr on Centos

http://www.mkyong.com/tomcat/how-to-check-tomcat-version-installed/ – Tomcat version

http://quark.humbug.org.au/publications/notes/bofh/msg00027.html – tomcat authentication

http://wiki.apache.org/solr/SolrTomcat – troubleshooting Solr

http://zugec.com/73-how-setup-search-api-apache-solr – installing solr and drupal info

Parse File Attachments with Apache Tika

I have been up to my neck in various Drupal search modules/configs/nightmare scenarios for almost a month now. But since Google has set the bar as high as they have, search must be easy, fast and accurate.

If you have the resources, an Apache Solr server is probably the way to go. But if you don’t have the infrastructure for that, there are still a lot of options out there. After working with the Search Files search mods and the native drupal search, I have decided to go with The Search API module with some of its submods – specifically, the Search API Attachments mod. This will allow me to parse attached documents in several different formats, including PDF, the one that I am mainly concerned with.

To parse attachments, you have to have some sort of a helper app installed. In the case of the Search API Attachments module, that would be Apache Tika.  Plus, Tika is also needed for Solr Server so this doc also serves that end as well.

Here are some of the prerequisites:

  1. Java 1.6 – http://xmodulo.com/2012/05/how-to-install-java-16-in-linux.html
  2. Apache Maven – http://xmodulo.com/2012/05/how-to-install-maven-on-centos.html
  3. Tika – the source from which the .jar will be compiled – http://tika.apache.org/0.7/gettingstarted.html

Once you have the prereqs installed, you can run:

mvn install from the root directory of the tika files. This will run for about three minutes and will end up compiling a nice .jar file that you’ll need.

http://www.acquia.com/fr/blog/use-apache-solr-search-files – read this great article on how to install Tika specifically for Drupal.

Once the compile is complete, you can test the parser by running this command:

java -jar ./tika-app/target/tika-app-1.6.jar -t /var/www/html/sites/all/pdfs/test.pdf

**NOTE THIS PATH REFERENCES THE LAYOUT OF MY SERVER**

If all the text from the PDF goes scrolling past your screen, you have everything installed correctly from the Tika/OS standpoint.

Now, configure Drupal to use the Tika install and you’ll be rolling.

Views – searching for data

So, I have all these emails from the court system. And I want to be able to search through them pretty easily. I simply expose the filter criteria for the body and the operator as well. But what i need  to do is search through everything on the whole site…..

SEO Resources

Since this blog is about Drupal and not SEO per se, I am finishing this subject up. I have a few web resources listed below that you might want to check out. But really, just google SEO Best Practices and start reading. Learn the subject then come up with a plan that suits your site.

  • What is your competition? What do they do? If you have direct competition that does a better job that you, reverse engineer their methods and implement them more effectively.
  • Get the Google Webmaster Tools and Bing Tools configured. They are free. Remember that Yahoo! search is really Bing.
  • Read. Learn SEO.
  • Develop a plan that is best for your site. If you are aiming at a niche market like me, you don’t need to compete; you need to drive awareness. And you need to satisfy a need. So make sure your content addresses what they want.
  • Develop a marketing plan that has nothing to do with the web in addition to all this other stuff.

Beginners SEO

Search Engine Optimization for WordPress « WordPress Codex

On-Page SEO Best Practices SEOmoz

Google Webmaster/Analytics, Bing Tools, SEO, Drupal and WordPress

I’ve been working with the Google analytics and webmaster tools. These are really helpful and I recommend that you take a look at them for your site. I’m doing the “dev” work for this site. Setting up the tags and such. I believe that it will actually be easier to do this in drupal than wordpress since wordpress.com doesn’t allow you to upload javascript files.

So I have the webmaster tools configured and they are working. that isn’t too bad. just get the account, the meta tag and put it into the wp widget. then verify. i’ve also tied the analytics page to the webmaster tools. i have a message from google that says that it is tied in but i have yet to see any actual data. will look more later.

I also have the Bing Webmaster tools configed for this site. I just checked that and I must have something misconfigured. althought the site map that I submitted it showing up. not sure. will look this afternoon.

I don’t believe that you really need more than those two engines. if anyone has an opinion on this, please share it because i’m probably missing something.