I recently completed a large data import. over 17,000 new nodes that added about 4 GB of text to my DB. The data source has a long history. Most of the nodes started out as paper records, going back as far as 20 years. The paper records were digitized at some point into PDFs, with searchable text and sometimes strange formatting.
In order to sanitize the records for upload into MySQL, I put the records through another conversion process as well. You can read about that in an earlier post. Suffice to say that i was not able to get everything in perfectly. I wound up with about 4% of the records in an unusable state. These records I’ll have to deal with separately. But how did I find them? A nice mysqldump command. Here it is:
mysqldump –skip-extended-insert drupal field_revision_body | pv -s $((4345298944*100/96)) | awk -F, ‘/.Courier New\\.\\.>. <o:p><\/o:p>/ {print $4,$5}’ | tee -a frb.out
What this nice command does in essence, is find all the records in field_revision_body in the Drupal DB that have one character per line. That is a good indication of a malformed record. It isn’t perfect but it really went a long way to show me the records that need to be worked with manually. It ends up finding all the records that have issues along with some that are actually somewhat acceptable. believe me, with 17,000 records, finding the few hundred that have problems without having to look at individual records was very nice.
Thanks to the wonderful guys over at Blackmesh.com hosting, the BEST hosting company in the world!
Filed under: Migrating Content | Tagged: data import, drupal imports, import corruption | Leave a comment »