Views Bulk Operations and Taxonomies – school of hard knocks

Views Bulk Operations is a great tool for making batched changes to large amounts of data. I have been using it for the last couple of days to add taxonomy terms to the around 17,000 new nodes that I added recently. And I just learned something the hard way.

Take care if you make bulk changes to taxonomies. It is not too difficult to inadvertently add the same term to a specific vocabulary over and over this way. Vocabs are meant to be updated dynamically; as the user add new terms to content, they are added to the vocab. On a individual basis, the autocomplete or dropdowns would ensure that the same term is reused if supplied instead of adding a new term for the SAME TERM. Which is what I did by mistake yesterday. And it was a mess.

Because I didn’t approach the VBO operation properly, I added the same term over and over again, to the tune of 8,000 times. Yuck. So, when I realized what I had done, I used VBO Delete to remove the extraneous terms. But what I didn’t consider was the overall impact of such a move. When I deleted all the individual terms that way, I deleted the nodes as well. 8,000 of them. Ouch.

A quick call to the BEST HOSTING COMPANY EVER – Blackmesh – and a restore was underway and the lesson was absorbed with tail between legs. I lost a whole day’s work, yes. But I gained a lot of insight as to how not to make this mistake again.

This is a fairly high level approach to what I did:

  • Created a vocabulary for the Content type
  • Created all the terms that I plan to use for this project
  • Added a term reference on the content type linked to the vocab
  • Used the autocomplete widget for the field
  • In the VBO View for this, chose the field for that content type that will hold the tax term
  • When running the VBO, I made sure the choose the predefined tax term. Since I used the autocomplete widget, i would type the first letters and wait for the complete choice to appear. This ensured that the existing term was being used
  • There was one node that needed its own term. I ran the VBO against this one node and added a new term to make sure that I was correct about new terms being added automatically. I was correct. This also confirmed that previously I had added what the system thought was a new term, every time the VBO changed a node
  • as I went through this, I checked the vocab to make sure that the number of terms was consistent with the original terms I had added.

Views Bulk Operations and mass Taxonomies changes

I have all my new nodes in place, about 17,000 of them. And now I need to be able to tag them so the searches are easier and more relevant.

One thing I did that was smart (for a change) is to append a unique descriptor to the title of each node during creation based on the original source. So all Supreme Court cases are appended with a -SC for example. This was really key for me. Without proper data prep, it will be a mess. So plan ahead.

So, using Views Bulk Operations, I am filtering the content type that I have created for this content based on these unique values. I am then able to change the blank “term reference” field on the content type to whatever I want. Something that will be easy to search for.

A word of caution. I wasn’t able to do these changes in one large job. I had to break it up into more manageable chunks, for me about a 1000 nodes at a time. I’m not 100% sure what the issue was, I was just receiving a lot of timeouts and hung queries. So be prepared to make a few concessions on time to get the op done.

Finding corrupted nodes in Drupal data imports

I recently completed a large data import. over 17,000 new nodes that added about 4 GB of text to my DB. The data source has a long history. Most of the nodes started out as paper records, going back as far as 20 years. The paper records were digitized at some point into PDFs, with searchable text and sometimes strange formatting.

In order to sanitize the records for upload into MySQL, I put the records through another conversion process as well. You can read about that in an earlier post. Suffice to say that i was not able to get everything in perfectly. I wound up with about 4% of the records in an unusable state. These records I’ll have to deal with separately. But how did I find them? A nice mysqldump command. Here it is:

mysqldump –skip-extended-insert drupal field_revision_body | pv -s $((4345298944*100/96)) | awk -F, ‘/.Courier New\\.\\.>. <o:p><\/o:p>/ {print $4,$5}’ | tee -a frb.out

What this nice command does in essence, is find all the records in field_revision_body in the Drupal DB that have one character per line.  That is a good indication of a malformed record. It isn’t perfect but it really went a long way to show me the records that need to be worked with manually. It ends up finding all the records that have issues along with some that are actually somewhat acceptable. believe me, with 17,000 records, finding the few hundred that have problems without having to look at individual records was very nice.

Thanks to the wonderful guys over at hosting, the BEST hosting company in the world!

Data Migration

Since I have been working extensively with syncing three separate but identical environments (Production, Quality and Sandbox) I’ve also been doing a lot of work with data migrations. migrating-to-drupal-7

Migrate data

There are different methods and tools that work for different situations and it’s been difficult to get a handle on what will work best for a particular job.

I’ve been able to export tables straight out of phpmyadmin and import them to the new sites in a few cases. But what about larger migrations? One of my big concerns with Drupal is the large differences between major versions. How do you go from 6 to 7? What if I want to take this WordPress blog and put it in Drupal?

I’ve read many different books on Drupal and found that this one looks promising for migrating content. It’s from Packtpub and contains many useful suggestions about available tools and how to use them. Check this one out if you need a road map to migrate a lot of data.