Finding corrupted nodes in Drupal data imports

I recently completed a large data import. over 17,000 new nodes that added about 4 GB of text to my DB. The data source has a long history. Most of the nodes started out as paper records, going back as far as 20 years. The paper records were digitized at some point into PDFs, with searchable text and sometimes strange formatting.

In order to sanitize the records for upload into MySQL, I put the records through another conversion process as well. You can read about that in an earlier post. Suffice to say that i was not able to get everything in perfectly. I wound up with about 4% of the records in an unusable state. These records I’ll have to deal with separately. But how did I find them? A nice mysqldump command. Here it is:

mysqldump –skip-extended-insert drupal field_revision_body | pv -s $((4345298944*100/96)) | awk -F, ‘/.Courier New\\.\\.>. <o:p><\/o:p>/ {print $4,$5}’ | tee -a frb.out

What this nice command does in essence, is find all the records in field_revision_body in the Drupal DB that have one character per line.  That is a good indication of a malformed record. It isn’t perfect but it really went a long way to show me the records that need to be worked with manually. It ends up finding all the records that have issues along with some that are actually somewhat acceptable. believe me, with 17,000 records, finding the few hundred that have problems without having to look at individual records was very nice.

Thanks to the wonderful guys over at hosting, the BEST hosting company in the world!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: