I implemented a version that seems to work quite well - although I still use (rather general and shoddy) regex to extract the html tags from the text. Here it is now in commented javascript:
Method
/**
* Manipulate text inside HTML according to passed function
* @param html the html string to manipulate
* @param manipulator the funciton to manipulate with (will be passed single word)
* @returns manipulated string including unmodified HTML
*
* Currently limited in that manipulator operates on words determined by regex
* word boundaries, and must return same length manipulated word
*
*/
var manipulate = function(html, manipulator) {
var block, tag, words, i,
final = '', // used to prepare return value
tags = [], // used to store tags as they are stripped from the html string
x = 0; // used to track the number of characters the html string is reduced by during stripping
// remove tags from html string, and use callback to store them with their index
// then split by word boundaries to get plain words from original html
words = html.replace(/<.+?>/g, function(match, index) {
tags.unshift({
match: match,
index: index - x
});
x += match.length;
return '';
}).split(/\b/);
// loop through each word and build the final string
// appending the word, or manipulated word if not a boundary
for (i = 0; i < words.length; i++) {
final += i % 2 ? words[i] : manipulator(words[i]);
}
// loop through each stored tag, and insert into final string
for (i = 0; i < tags.length; i++) {
final = final.slice(0, tags[i].index) + tags[i].match + final.slice(tags[i].index);
}
// ready to go!
return final;
};
The function defined above accepts a string of HTML, and a manipulation function to act on words within the string regardless of if they are split by HTML elements or not.
It works by first removing all HTML tags, and storing the tag along with the index it was taken from, then manipulating the text, then adding the tags into their original position in reverse order.
Test
/**
* Test our function with various input
*/
var reverse, rutherford, shuffle, text, titleCase;
// set our test html string
text = "<h2>Header</h2><p>all the <span class=\"bright\">content</span> here</p>\nQuick<em>Draw</em>McGraw\n<em>going</em><i>home</i>";
// function used to reverse words
reverse = function(s) {
return s.split('').reverse().join('');
};
// function used by rutherford to return a shuffled array
shuffle = function(a) {
return a.sort(function() {
return Math.round(Math.random()) - 0.5;
});
};
// function used to shuffle the middle of words, leaving each end undisturbed
rutherford = function(inc) {
var m = inc.match(/^(.?)(.*?)(.)$/);
return m[1] + shuffle(m[2].split('')).join('') + m[3];
};
// function to make word Title Cased
titleCase = function(s) {
return s.replace(/./, function(w) {
return w.toUpperCase();
});
};
console.log(manipulate(text, reverse));
console.log(manipulate(text, rutherford));
console.log(manipulate(text, titleCase));
There are still a few quirks, like the heading and paragraph text not being recognized as separate words (because they are in separate block level tags rather than inline tags) but this is basically a proof of method of what I was trying to do.
I would also like it to be able to handle the string manipulation formula actually adding and removing text, rather than replacing/moving it (so variable string length after manipulation) but that opens up a whole new can of works I am not yet ready for.
Now I have added some comments to the code, and put it up as a gist in javascript, I hope that someone will improve it - especially if someone could remove the regex part and replace with something better!
(outputs to console)
And now finally using an HTML parser
(http://ejohn.org/files/htmlparser.js)
Demo: http://jsfiddle.net/EDJyU/