Cleaning HTML in a content editor with javascript using jQuery
Some days ago, I saw a site that friends of mine created. They use a free content management
system for their site, but they had pasted content from Word. The HTML was quite dirty...
Years ago, I had written a javascript for cleaning-up pasted HTML (for
instance the content of a Word document), within a web based content editor purely for IE 5/6.
It worked quite well, but was a mix based on blacklisting and whitelisting. Today, as I am using
jQuery, it seemed to me much easier to create such a script for multiple browsers, now based purely
on whitelisting.
The script I created is not meant to be used for securing content that will be
submitted (think of cross site scripting). You can not be sure that non cleaned HTML will be
submitted. It is only meant to help content editors who are somewhat unaware of these pasting issues.
I will show you how I have created the script, and hopefully it will be of some use
to you. First I started to create a script that was looping through all elements
of the HTML and that removed those elements not mentioned in the whitelist. This script
looks like this:
var tagsAllowed = "|h1|h2|h3|p|span|div|a|b|strong|br|hr|";
//Extension for getting the tagName
$.fn.tagName = function() {
return this.get(0).tagName.toLowerCase();
}
function clearUnsupportedTagsAndAttributes(obj) {
$(obj).children().each(function() {
//recursively down the tree
clearUnsupportedTagsAndAttributes($(this));
var tag = $(this).tagName();
if(tagsAllowed.indexOf("|" + tag + "|") < 0) {
$(this).replaceWith($(this).html());
}
});
}
Then I realized that if a script or style tag was not whitelisted, you would not want
to keep the content of these tags in the HTML. Therefore I have changed the following part
of the script:
...
if(tagsAllowed.indexOf("|" + tag + "|") < 0) {
if(tag == "style" || tag == "script")
$(this).remove();
else
$(this).replaceWith($(this).html());
}
...
Now we removed all non whitelisted tags, I also would like to remove unwanted attributes. As
you can also execute javascript on mouseover or click, and maybe you do not wish to include
these in the html. For this I have added the following code:
var attributesAllowed = new Array(3);
attributesAllowed["span"] = "|id|class|";
attributesAllowed["div"] = "|id|class|onclick|style|";
attributesAllowed["a"] = "|id|class|href|name|";
And added an else block to the clearUnsupportedTagsAndAttributes function. For IE I needed to catch
errors because some attributes IE loops through are not suported:
...
else {
var attrs = $(this).get(0).attributes;
for(var i = 0; i < attrs.length; i++) {
try {
if(attributesAllowed[tag] == null ||
attributesAllowed[tag].indexOf("|" + attrs[i].name.toLowerCase() + "|") < 0) {
$(this).removeAttr(attrs[i].name);
}
}
catch(e) {} //Fix for IE, catch unsupported attributes like contenteditable and dataFormatAs
}
}
...
Next I would like to remove empty tags. But needed to allow some tags to remain in the content even
if they are empty, for example the br tag, so I added the following, and updated the above else statement:
var emptyTagsAllowed = "|br|hr|";
...
else {
if($(this).html().replace(/^\s+|\s+$/g, '') == "" && emptyTagsAllowed.indexOf("|" + tag + "|") < 0)
$(this).remove();
else
{
var attrs = $(this).get(0).attributes;
...
The last hurdle was comments in the HTML. When pasting Word, some comments are also present
and not yet removed. I have created an extension for this, that I can call after cleaning
the HTML with the clearUnsupportedTagsAndAttributes function:
//Extension for removing comments
$.fn.removeComments = function() {
this.each(
function(i, objNode){
var objChildNode = objNode.firstChild;
while (objChildNode) {
if (objChildNode.nodeType === 8) {
var next = objChildNode.nextSibling;
objNode.removeChild(objChildNode);
objChildNode = next;
}
else
{
if (objChildNode.nodeType === 1) {
//recursively down the tree
$(objChildNode).removeComments();
}
objChildNode = objChildNode.nextSibling;
}
}
}
);
}
Then to my shock, when pasting code from Visual Studio, I got an error on the clearUnsupportedTagsAndAttributes
function. The HTML was not well formed and looping through it failed. So I added a variable someError that was false
and put a try catch in the clearUnsupportedTagsAndAttributes function. In the catch part I set someError to true.
I now show an error message, but what you probably want is to go and paste only the text in case of an error. You can
see the full javascript here: CleanContent.js
A working example
You can try it in the example underneath that is using a simple contenteditable div area.
Click on the Clean HTML button to clean up the HTML in the content editor. I have already
inserted some 'unclean' HTML, but you can also try pasting from Word.
RICH TEXT EDITOR
Some supported heading
non supported heading
RED
bold
test ?>
italic
Sample MS Word content
UNDERLYING HTML
You can try and tweak the tags and attributes allowed, so that it suites your needs or expand the script to your needs.
Happy scripting...