circusmachina: Implementing AutoCorrect in JavaScript

Although I like PmWiki and BlogIt for their simplicity, I sometimes miss a feature or two from the content-management software I developed. I prefer using Markdown to Wiki syntax when writing posts, and fortunately there's a recipe for that. There is no recipe (as of this writing) for the second feature I miss: auto-correct. The software I wrote would replace various combinations of characters with nicer-looking replacements -- such as (R) with ®, (C) with ©, straight quotes with curly quotes, etc. Maybe I've just been spoiled by the implementation of the feature in LibreOffice, but having to look at straight quotes and uncorrected copyright symbols seems a little jarring. I mean, if the technology exists to make your text beautiful, why not use it wherever it can be used?

It has occurred to me that I could simply rewrite my original software to be more like PmWiki: to use the file system instead of a database while retaining the features that I want. But, since my free time is limited, my primary focus remains on writing a computer language and then a game engine. It was quicker to write a post-processing script that would run after each page on the site loads, subtly tweaking the items that needed to be adjusted until the page looked the way I wanted it to look.

Why a Script?

There were a couple of considerations that pointed toward a client-side solution rather than a server-side solution. The first consideration was that, in order to implement auto-correct, I would need access to the text of each page as it was rendered. PmWiki has a function that can be used to insert a routine that will process code contained in custom wiki markup, and you can specify when that routine is executed in the rendering pipeline. The trouble is that, after trying it, I discovered at no point does your custom routine seem to have access to just the rendered text of the page -- not even with the _end option. My routine ended up correcting the URLs inside of generated links, resulting in invalid links. I could have specified a number of regular expressions to filter out <a> tags, as well as other tags that I wanted to avoid, but that would have needlessly complicated the code.

I next attempted to alter the Markdown recipe so that the text contained within Markdown-formatted posts would have the desired corrections applied to it. This ran into similar problems: the addresses of some links were being reformatted, resulting in invalid links. Furthermore, only the text of the posts was affected; the titles and links on the sidebar were exempt from this processing. I needed a way to single out just the text of the rendered page -- all of the text.

The DOM provides just such a way. The text associated with each HTML element is contained within a Text node, which is a child of the node that represents the HTML element. Thus, the text of a link such as:

<a href="http://www.circusmachina.com">Behold -- my glorious home page!</a>

is represented (in a somewhat simplified form) as:

Element { tagName: "A"; href: "http://www.circusmachina.com"; ... };
- Text { nodeValue: "Behold -- my glorious home page!"; ... };

Note how only the text between the <a> and </a> elements is included in the text node; the text of the link itself is contained within an attribute of the anchor node. Changes to the contents of the text node will not invalidate the link itself; text nodes are self-contained. If it could get access to just the text nodes in the document, then an autocorrect routine could run without fear of altering something crucial.

The DOM provides a way to do that, too: a method called createTreeWalker(), which returns an iterator (a TreeWalker) that allows you to iterate through a list of nodes that is filtered based on specific criteria. One of the criteria that can be specified is NodeFilter.SHOW_TEXT which, when specified by itself, ensures the list contains only Text nodes. Since the DOM provided what I sought, JavaScript became the natural choice for implementing the autocorrect feature; and by tying the autocorrect routine to the onLoad event of the document body, I could ensure that the routine ran after the page had been rendered (and thus, after the DOM had been fully-constructed).

Using the Code

To use the code, simply include it in your header, either directly or by reference, and then call DocumentText.autoCorrect() at some point. I call the routine in response to the body.onLoad event, but you may wish to call it at a different point in time.

DocumentText.autoCorrect() accepts a couple of parameters that are designed to make the code more flexible:

Specifying Languages

One thing that occurred to me is that the corrections I want, which make sense in the English language (the U.S. flavor, anyway), may not be grammatically correct or even desired in other languages. Thus, it is possible to specify a language code as part of the Corrections namespace; the same code can be used when calling DocumentText.autoCorrect() to ensure that only the corrections for that language are applied. I fill this parameter with the value of navigator.language, but you can also hard-code a value if desired. If the language specified is not found in the Corrections namespace, then the routine will fall back on the one defined by Corrections.defaultLanguage.

Specifying Items to Ignore

Further testing on my own page showed that there are times when you don't want corrections to be applied: such as in the middle of a piece of code. If your user wants to cut and paste the code directly into her IDE of choice, then the code may not run properly if the straight quotes have been replaced with curly quotes. Therefore it is also possible to specify an array of strings that names elements which should be ignored when calling DocumentText.autoCorrect(). I use this functionality to prevent corrections from occurring inside of <code> elements, as well as to elements to which certain CSS classes have been applied. The strings passed as part of this array can be formatted as follows:

class: name of CSS class - Use this to specify the name of a class that, if applied to an element, should cause the text of the element to be ignored. The routine checks to see whether the parent of the text node has the named class in its classList property and, if so, the text is ignored.
nodeName: name of an HTML node - Use this to specify an element, such as CODE that should be ignored when it is encountered. The routine checks to see whether the text node is a child of an element whose tagName property matches the one specified, and ignores the text if it is.
You can also specify a name without a prefix, in which case it is taken to represent the id of an element whose text should be ignored.

The Code

You can download the code here, and it is listed below, although the version of GeSHi used by the the source block recipe for PmWiki seems to have difficulty with the JavaScript regular expressions in the source.

/* autocorrect.js

   Copyright (C) 2015 Michael T. Malicoat.  All rights reserved.

   This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
   Unported License. To view a copy of this license, visit
   http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to
        Creative Commons
        444 Castro Street
        Suite 900
        Mountain View, California, 94041
        USA.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
*/

/* Corrections namespace */
var Corrections = {};

/* The default language to use if an invalid language code is specified, or if
   no language code is specified.
*/
Corrections.defaultLanguage = "en-US";

/* This array contains the strings and regular expressions that are used to
   find text which needs to be autocorrected, and the replacement text to use.

   You can specify additional corrections by specifying an array similar to this
   one as a property of the Corrections namespace, using the language code as
   the name of the property.  When DocumentText.autoCorrect() is called with 
   the same language code, your corrections will be used.
*/
Corrections["en-US"] = [
  // Copyright symbol => &copy; => #169
  ["(C)", String.fromCharCode(169)],
  // Registered trademark symbol => &reg => #174
  ["(R)", String.fromCharCode(174)],
  // Trademark symbol => &trade; => #8482
  ["(TM)", String.fromCharCode(8482)],
  // Exactly two hyphens surrounded by whitespace => <space>&mdash;<space> => #32,#8212,#32
  [/\s-{2}\s/g, String.fromCharCode(32,8212,32)],
  // Exactly two hyphens followed by a quote => &mdash;&rdquo; => #8212,#8221
  ["--\"", String.fromCharCode(8212,8221)],
  // Single dash surrounded by whitespace => <space>&ndash;<space> => #32,#8211,#32
  [/\s-\s/g, String.fromCharCode(32,8211,32)],
  // Single apostrophe prefixed by space => <space>&lsquo; => #8216
  [/\s\'/g, String.fromCharCode(32, 8216)],
  // Apostrophe at beginning of node string => &lsquo; => #8216
  [/^\'/, String.fromCharCode(8216)],
  // Any remaining apostrophes => &rsquo; => #8217
  [/\'/g, String.fromCharCode(8217)],
  // Double-quote prefixed by space => <space>&ldquo; => #32,#8220
  [/\s\"/g, String.fromCharCode(32, 8220)],
  // Double-quote followed by space => &rdquo;<space> => #8221,32
  [/\"\s/g, String.fromCharCode(8221, 32)],
  // Double-quote prefixed by newline => <newline>&ldquo; => <newline> #8220
  ["\n\"", "\n" + String.fromCharCode(8220)],
  // Double-quote followed by newline => &rdquo;<newline> => #8221 <newline>
  ["\"\n", String.fromCharCode(8221) + "\n"],
  // Double-quote at beginning of node string => &ldquo; => #8220
  [/^\"/, String.fromCharCode(8220)],
  // Double-quote preceded by opening paren => (&ldquo; => (,#8220
  ["(\"", "(" + String.fromCharCode(8220)],  
  // Double-quote at end of node string => &rdquo; => #8221
  [/\"$/, String.fromCharCode(8221)],
  // Any remaining double-quote characters => &rdquo; => #8221
  [/\"/g, String.fromCharCode(8221)],
  // Exactly three periods => &hellip; => #8230
  [/\.{3}/g, String.fromCharCode(8230)]
];

/* DocumentText namespace */
var DocumentText = {};

/* This array lists the elements which are, by default, skipped by
   DocumentText.autoCorrect().  The items named in this array are concatenated
   with any passed to that function.
*/
DocumentText.defaultItemsToIgnore = [
  /* By default, we skip <TEXTAREA> nodes, since it is usually undesirable to
     autocorrect the text contained therein.
  */
  'nodeName:TEXTAREA'
];

/* Walk all nodes in the document and return an array that consists only of
   text nodes.

   'itemsToIgnore', if specified, should be an array of strings that names
   items in the document whose text should be ignored.  The strings can be
   plain and unadorned, in which case the id of the parentNode is checked
   against the string.  The strings can also be formatted as follows:

     * class:className - The classList of the parentNode will be checked against
         'className'.

     * nodeName:nodeName - The nodeName of the parentNode will be checked 
         against 'className'.

   Note that the prefixes are not case-sensitive, but the values that follow 
   them may be.  

   If a match is found (either in parentNode.id, parentNode.classList[], or
   parentNode.nodeName), then the node is skipped.  Otherwise, it is added to
   the list of nodes returned by the function.

   Node walking code adapted from:
    http://stackoverflow.com/questions/2579666/getelementsbytagname-equivalent-for-textnodes
*/
DocumentText.nodes = function(itemsToIgnore) {
  // Construct a tree walker that will return text nodes
  var walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT, 
    null, false);

  var node;
  var textNodes = [];
  var insertNode = true;

  var ignoreComponents = [];
  var ignoreCategory;
  var ignoreName;

  // Loop through all text nodes
  while( node = walker.nextNode() ) {          
    if ( (itemsToIgnore) && (itemsToIgnore.length > 0) ) {
      insertNode = true;

      for (i = 0; i < itemsToIgnore.length; i++) {
        // Split the string that specifies the item to ignore
        ignoreComponents = itemsToIgnore[i].split(":");

        if ( ignoreComponents.length > 1 ) {
          // The category to ignore should be the first element
          ignoreCategory = ignoreComponents[0].trim().toLowerCase();
          // and the named item in the second element
          ignoreName = ignoreComponents[1].trim();
        }

        // Otherwise, we just grab the first component and assume it's an id
        else
          ignoreName = ignoreComponents[0].trim;

        switch(ignoreCategory) {
          case "class":
            if ( node.parentNode.classList.contains(ignoreName) )
              insertNode = false;

            break;

          case "nodename":
            if ( node.parentNode.nodeName == ignoreName )
              insertNode = false;

            break;

          default:
            if ( node.parentNode.id == ignoreName )
              insertNode = false;
        }
      }
    }

    if ( insertNode )
      textNodes.push(node);
  }

  return textNodes;
}

/* Correct the text in every text node found in the document.

   'languageCode' should specify the language code of a language for which
   corrections exist; for example, 'en-US'.  If invalid or not specified, this
   function will use the value of Corrections.defaultLanguage.

   'itemsToIgnore', if specified, should be an array of strings which name
   items to skip.  This array is passed to DocumentText.nodes().  See the
   comments there for information on how the strings should be formatted.

   In addition, this function will skip nodes that match the items in
   'DocumentText.defaultItemsToIgnore'.
*/
DocumentText.autoCorrect = function(languageCode, itemsToIgnore) {
  // Ensure we have a language code
  if ( (!languageCode) || (languageCode == "") )
    languageCode = Corrections.defaultLanguage;

  // Get the corrections to use
  corrections = Corrections[languageCode];
  // If we couldn't find corrections for the specified language, try the default
  if ( ( !corrections ) || ( corrections.length == 0) )
    corrections = Corrections[Corrections.defaultLanguage];

  // Couldn't find corrections for the default language, so quit
  if ( ( !corrections ) || ( corrections.length == 0) )
    return;    

  // Ensure 'itemsToIgnore' is an array, if an empty one
  if ( !itemsToIgnore )
    itemsToIgnore = [];

  // Get all text nodes in the document, less the ones we need to ignore
  documentText = this.nodes(itemsToIgnore.concat(this.defaultItemsToIgnore));    

  // Process all text nodes
  for (thisNode = 0; thisNode < documentText.length; thisNode++) {
    // Process all substitutions
    for (thisCorrection = 0; thisCorrection < corrections.length; 
      thisCorrection++) 
    {
      // Skip invalid and empty nodes
      if ( (!documentText[thisNode].nodeValue) || 
        (documentText[thisNode].nodeValue == "") )
          continue;

      // Make replacements
      /* To be a valid replacement, the item at the specified index must
         have at least two items (if more than 2, the remaining items are 
         ignored).
      */
      if ( corrections[thisCorrection].length > 1 ) {
        needle = corrections[thisCorrection][0];
        replacement = corrections[thisCorrection][1];

        documentText[thisNode].nodeValue = 
          documentText[thisNode].nodeValue.replace(needle, replacement);
      }
    }
  }
}

Implementing AutoCorrect in JavaScript

by michael, on February 03, 2015, at 09:32 AM | print

Why a Script?

Using the Code

Specifying Languages

Specifying Items to Ignore

The Code

tags: Code, Javascript