Finding corrupted nodes in Drupal data imports

I recently completed a large data import. over 17,000 new nodes that added about 4 GB of text to my DB. The data source has a long history. Most of the nodes started out as paper records, going back as far as 20 years. The paper records were digitized at some point into PDFs, with searchable text and sometimes strange formatting.

In order to sanitize the records for upload into MySQL, I put the records through another conversion process as well. You can read about that in an earlier post. Suffice to say that i was not able to get everything in perfectly. I wound up with about 4% of the records in an unusable state. These records I’ll have to deal with separately. But how did I find them? A nice mysqldump command. Here it is:

mysqldump –skip-extended-insert drupal field_revision_body | pv -s $((4345298944*100/96)) | awk -F, ‘/.Courier New\\.\\.>. <o:p><\/o:p>/ {print $4,$5}’ | tee -a frb.out

What this nice command does in essence, is find all the records in field_revision_body in the Drupal DB that have one character per line.  That is a good indication of a malformed record. It isn’t perfect but it really went a long way to show me the records that need to be worked with manually. It ends up finding all the records that have issues along with some that are actually somewhat acceptable. believe me, with 17,000 records, finding the few hundred that have problems without having to look at individual records was very nice.

Thanks to the wonderful guys over at Blackmesh.com hosting, the BEST hosting company in the world!

Advertisements

Drupal Site indexing – MySQL Errors, CRON timeouts

Since I recently dumped almost 17,000 new nodes into my DB in a relatively short time, I have been keeping a close eye on how the back end is responding. The main concern that I have is the indexing process for the new data. I began receiving errors from MySQL as seen in the SS.

In the Drupal Admin UI, check the number of nodes that are indexed per

Image

CRON run. In my case, I had it set to 500, the maximum. This was a bit of overkill and I ended up making the value lower. I also increased the PHP memory, which you can check under “status reports”.

…PHP 5.3.27 (more information)
OK
PHP extensions Enabled
OK
PHP memory limit 512M
OK
PHP register globals Disabled…

I really had simply to tweak the settings. I ended up at 100 nodes per run, one run per hour and PHP mem allocation as seen above. From the Search Options UI, you can see the status of the indexing, how fast, how many nodes, etc.

Preload Images for better CSS performance

I use a lot of CSS enhancements for things like hover: images, where the image changes slightly as your mouse moves over it. Great little trick to help focus the user’s attention on specific elements on the page. But if you don’t preload the images, you will have a slight (or not so slight) pause when you hover the first time – a sort of flicker. Preloading isn’t hard but in the Drupal world there are a couple of extra things that I did.

I have a custom content type called HTML text. Any sort of HTML can be added in this type as a specific node and placed where ever I want. In my case placement is done by using Panels. So what I did was to create one node that references all the images that I want to preload.

<span>

<img src=”/sites/images/icons/F-chat-lg.png” alt=”DOCResource.org Chat”>
<img src=”/sites/images/icons/F-forums-lg.png” alt=”DOCResource.org forums”>
<img src=”/sites/images/icons/F-research-lg.png” alt=”DOCResource.org research”>
<img src=”/sites/images/icons/F-services-lg.png” alt=”DOCResource.org Serives”>
<img src=”/sites/images/icons/F-top-lg.png” alt=”DOCResource.org”>
<img src=”/sites/images/icons/F-Courts-lg.png” alt=”DOCResource.org Research Courts”>
<img src=”/sites/images/buttons/more-btn-blk.png” alt=”DOCResource.org More”>
<img src=”/sites/images/buttons/more-btn-ylw.png” alt=”DOCResource.org Research”>
<img src=”/sites/images/buttons/more-btn-ylw.png” alt=”DOCResource.org Research”>
<img src=”/sites/default/files/u1/GreenFlg.png” alt=”DOCResource.org Research”>
<img src=”/sites/default/files/u1/GreenFlg-Blk.png” alt=”DOCResource.org Research”>
</span>

easy as pie. i added the class ‘preload-images-class” so I can manipulate the node using CSS. I added this class from within the panel with the node.

Then, I added the node to the front page of my site and set the class in my CSS to have a couple of different styles.

.panel-pane.pane-node.preload-images-class.no-title.block {
margin-bottom: -35em;
visibility:hidden;

the visibility is the key. The images are loaded but don’t show up. But they still impact the spacing of the DIVs so I also added the margin-bottom element to account for and remove the extra space.

I also added a few other images at the same time. I have a couple of buttons that change on a hover event and they are small and don’t affect the loading of the page.

A small detail for the overall look of the site but one that gives a nice spit and polish to the UX.

Import PDFs into Drupal

I have about 19,000 PDFs to get into Drupal. Yeah, fun. And there aren’t any easy ways to do it. There don’t appear to be any modules that support this. So, I had to get a little creative. And I have made it work.

Biggest caveat – This doesn’t bring a nice looking PDF into Drupal preserving all the wonderful PDF formatting. It really will just bring in the textual content. Which suits my needs because the text is all I really need. This isn’t a great solution and it may be the biggest one off in my career, but if you need what the PDF says, and not how it looks, this will work for you.

One of the issues here is that the PDFs have a lot of weird formatting in them. Many are actually scans of decades old paper court documents. They wind up with all sort of page breaks, table formatting and other oddities.

This is an overview of the process – high level.

  • MS Windows for the OS of the client doing all this
  • Convert PDFs into DOCs. Great utility – boxoft.com *
  • Use MS Word to remove faulty formatting
  • MS Outlook to email docs in the body of the email
  • Hotmail for the email transport to an IMAP mailbox on QMail **
  • Mailhandler Drupal Module to receive the doc
  • Automate the process with  Macro Recorder from http://www.jitbit.com ***
* Freeware
** can be any Mailhandler enabled mailbox
*** not free, but a great product with a generous 40 day trial

I’m using MS product for good reasons. I did try to make this work with Open Office but it doesn’t have the features that I need.

MS Word, used with Outlook and Hotmail, will allow you to send the doc in the body of the email easily, not as an attachment. I looked for attachment handling but I didn’t really see one. Once the document has been sent in the body, it essentially loses its MS Word attributes and become simple formatted text. So, it can be easily processed by Mailhandler.

Hotmail is necessary, QMail is not. You just need Mailhandler to be able to receive emails and turn them into nodes. That is a project unto itself that I covered here a few months ago.

The Macro Recorder is for the automation part. What I was able to do was create a “map” or “procedure” of sorts consisting of keystrokes only (well, one mouse click, but no mouse movement) that is consistent for each doc. This “map” opens the file from a window already opened into MS Word. The map has tabs, backspaces, arrows, and key combinations that are consistent every time. If you don’t know keyboard shortcuts you’ll need to learn them. I suppose that you can

Image

use the mouse more, maybe for the whole thing even but I have used macro recorders before and they are finicky and I believe that they deal best with the keyboard.

This process will require refinement: you’ll have to play with it. And it is slow. I am currently sending about 3 docs per minute. so it is going to take about (19,000 total) two weeks. But, it is a one off. And it is going to be really valuable to have the data so for me it is worth it. and once you get it moving, it doesn’t require much in the way of babysitting.

The upper image is the original,

Image

the bottom is the result. Not great, but it gets the job done!

Drupal and Media – Great new book

One of the biggest challenges Drupal presents is dealing with media files. Because of the open nature of the product, there is usually a lot of time needed to make something as simple as an image gallery. I spent many hours designing one myself using different image handlers, a custom content type, views and HTML entities that present

Image

editing controls.

Packtpub.com has a great resource for those who are dealing with media files. I just picked it up but it looks like it is going to be a great resource for me. I specifically plan to use it for a video based help system for the website that I am currently working on. But check out the book at the link below.

http://www.packtpub.com/drupal-7-media/book?utm_source=blog&utm_medium=link&utm_campaign=bookmention

CKEditor – quirks and customizations

CKEditor is the best choice out there for WYSIWYG editing in the Drupal world. But there are some quirks that I am working through.

If you can’t get a custom text format to save to a textarea field of a custom content type, try adding a little whitespace to the default value of the field. Just a couple of carriage returns should do. I had to do this to get it to work for me.

Adding customizations to CKEditor. 4.x.

there are a few customizations that can be done for CKEditor from within Drupal. But some of the best ones need to be configured in the ckeditor.config.js file.

Use the link above to find many of the easter eggs in CKEditor.

Here is a good example of how to expand the editor area on your site:

*****************THIS IS FROM CKEDITOR.CONFIG.JS*****************

* Append here extra CSS rules that should be applied into the editing area.
* Example:
* config.extraCss = ‘body {color:#FF0000;}’;
*/
config.extraCss = ”;
config.height = 350;
/* A list of plugins that must not be loaded. This setting makes it possible
* to avoid loading some plugins defined in the {@link CKEDITOR.config#plugins}
* setting, without having to touch it.

*****************END OF EXAMPLE********************************

the config.height that you see makes my editor area taller.

don’t want the strange little HTML tags in the bottom left of your editor? You can’t get rid of them in Drupal, you get rid of them from ckeditor.config.js

****************************************************************

* **Note:** Plugin required by other plugin cannot be removed (error will be thrown).
* So e.g. if `contextmenu` is required by `tabletools`, then it can be removed
* only if `tabletools` isn’t loaded.
*
* config.removePlugins = ‘elementspath,save,font’;*/
config.removePlugins = ‘resize,elementspath’;
/** @cfg
*/
/**

*****************END OF EXAMPLE********************************

the line in bold rids you of TWO things, the resize grippie in the bottom right and the HTML elements on the bottom left. There are many other things that you can specify here. See the extensive CKEditor docs for more.

The “More” block link – make it look professional with css

I use a lot of views to display content. In the pager section of your view you have an option to include a “more” link. I like the concept but i really want more flexibility

Image

in styling and such. So, what ended up creating is a “more” link using a little custom HTML and some CSS. First, I used the Header section of the View to add the link with simple HTML. But I also added a class selector on the link (for the images and placement). I also added a <span> tag on the text that will be displayed for the link itself. This allows me to style the text of the link separately.

Image

Then, i created some CSS for the new class more-button-added. they are as shown:

.more-button-added {
background: url(“/sites/images/buttons/more-btn-blk.png”) no-repeat;
position: relative;
float: right;
margin: -34px -19px;
}
.more-button-added:hover {
position: relative;
background: url(“/sites/images/buttons/more-btn-ylw.png”) no-repeat;
float: right;
margin: -34px -19px;
text-decoration: none;

these allow the image to be inserted and moved to the appropriate spot right along with the text for the link. Now, I don’t want a More text link in addition to the image, so, i use the selector i created in <span> called .more-link-clear to style this way:

.more-link-clear {
color:transparent;
text-decoration:none;
}
.more-link-clear:hover{
opacity:0;
}

now, the link is there, it is directly behind my image and it can’t be seen! and since it is a class and not an ID, I can use it for many of the other “more” buttons that I will place. I ended up using this technique on five different Panels panes on the front page of my site and they work really well. all of the panes are views, so i simply have to change the HTML link target to make it custom for that view. the css remains the same. very nice!