How to ban words with diacritics using a blacklist array and regex?

Question

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don't know how to check against words with diacritics from the array:

var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new RegExp('\\b' + bannedWords.join("\\b|\\b") + '\\b', 'i');

$(function () {
  $("input").on("change", function () {
    var valid = !regex.test(this.value);
    alert(valid);
  });
});

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check'>

Now on the word băţ it returns true instead of false for example.

Possible duplicate of utf-8 word boundary regex in javascript — Chiu
– Chiu, Commented Aug 25, 2016 at 8:52
That link does not help me. Or at least I don't understand how is helping me. Can you explain why do you think that my question is a duplicate of that? — Ionut Necula
– Ionut Necula, Commented Aug 25, 2016 at 12:05
Instead of using the word boundary \b, try using what the referring answer suggested. And ăţ are not ASCII characters. That's why \b fails. This is where the utf-8 word steps in. — Chiu
– Chiu, Commented Aug 25, 2016 at 12:54
Simply put, diacritics means utf-8. That's why I flagged your question duplicated. Hope it helps. — Chiu
– Chiu, Commented Aug 25, 2016 at 13:57
I'm not sure of what the problem is. If you have a list of banned words, put them into a single regex with alternations. Then check that. Why go through all this hassle? If you have a large list, make a regex trie out of a ternary tree. Grab this app (screenshot) to make it for you. And you shouldn't be using a word boundary anyway, you should use a whitespace boundary. (?<!\S)(?:stuff|or|stuff)(?!\S) — user557597
– user557597, Commented Aug 31, 2016 at 18:09

myf · Accepted Answer · 2016-08-30 09:25:02Z

5

+25

Chiu's comment is right: 'aaáaa'.match(/\b.+?\b/g) yelds quite counter-intuitive [ "aa", "á", "aa" ], because "word character" (\w) in JavaScript regular expressions is just a shorthand for [A-Za-z0-9_] ('case-insensitive-alpha-numeric-and-underscore'), so word boundary (\b) matches any place between chunk of alpha-numerics and any other character. This makes extracting "Unicode words" quite hard.

For non-unicase writing systems it is possible to identify "word character" by its dual nature: ch.toUpperCase() != ch.toLowerCase(), so your altered snippet could look like this:

var bannedWords = ["bad", "mad", "testing", "băţ", "bať"];
var bannedWordsRegex = new RegExp('-' + bannedWords.join("-|-") + '-', 'i');

$(function() {
  $("input").on("input", function() {
    var invalid = bannedWordsRegex.test(dashPaddedWords(this.value));
    $('#log').html(invalid ? 'bad' : 'good');
  });
  $("input").trigger("input").focus();

  function dashPaddedWords(str) {
    return '-' + str.replace(/./g, wordCharOrDash) + '-';
  };

  function wordCharOrDash(ch) {
    return isWordChar(ch) ? ch : '-'
  };

  function isWordChar(ch) {
    return ch.toUpperCase() != ch.toLowerCase();
  };
});

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check' value="ba">
<p id="log"></p>

edited Aug 30, 2016 at 9:25

answered Aug 29, 2016 at 9:58

myf

12.5k3 gold badges43 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Adam Katz Over a year ago

Your "just shorthand for" link displays backslashes but should not. I assume this is a markdown error, but I'll let you correct it yourself.

Adam Katz Over a year ago

[\b] and [^\b] don't work that way. Because these are inside character classes, they are interpreted as "is a backspace character" and "is not a backspace character" (read more here). The opposite of \b (zero-width word boundary) is \B (zero-width non-word boundary), which (in JavaScript) uses the same [A-Za-z_0-9] definition of "word characters" and is therefore unhelpful here.

myf Over a year ago

Thanks for remarks: corrected link formatting. And thanks for that /^[\b]$/.test('\u0008') === true quip, I admit I didn't know that. But it was not that relevant, for I just wanted to demonstrate that "\b works just with ASCII" thing like you did in your answer.

Adam Katz · Accepted Answer · 2016-09-01 16:04:51Z

Let's see what's going on:

alert("băţ".match(/\w\b/));

This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so aä, pπ, and zƶ match \w\b\W since they contain a word character, a word boundary, and a non-word character.

I think the best you can do is something like this:

var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
                       + bannedWords.join('|')
                       + ')(?=' + bound + '|$)', 'i');

where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)

Given ["bad", "mad", "testing", "băţ"], this becomes:

/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i

This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).

You can also use var bound = '[\\s!-/:-@[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.

Wiktor Stribiżew · Accepted Answer · 2016-09-01 20:24:42Z

You need a Unicode aware word boundary. The easiest way is to use XRegExp package.

Although its \b is still ASCII based, there is a \p{L} (or a shorter pL version) construct that matches any Unicode letter from the BMP plane. To build a custom word boundary using this contruct is easy:

\b                     word            \b
  ---------------------------------------
 |                       |               |
([^\pL0-9_]|^)         word       (?=[^\pL0-9_]|$)

The leading word boundary can be represented with a (non)capturing group ([^\pL0-9_]|^) that matches (and consumes) either a character other than a Unicode letter from the BMP plane, a digit and _ or a start of the string before the word.

The trailing word boundary can be represented with a positive lookahead (?=[^\pL0-9_]|$) that requires a character other than a Unicode letter from the BMP plane, a digit and _ or the end of string after the word.

See the snippet below that will detect băţ as a banned word, and băţy as an allowed word.

var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new XRegExp('(?:^|[^\\pL0-9_])(?:' + bannedWords.join("|") + ')(?=$|[^\\pL0-9_])', 'i');

$(function () {
  $("input").on("change", function () {
    var valid = !regex.test(this.value);
    //alert(valid);
    console.log("The word is", valid ? "allowed" : "banned");
  });
});

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
<input type='text' name='word_to_check'>

SamWhan · Accepted Answer · 2016-09-02 10:04:44Z

2

In stead of using word boundary, you could do it with

(?:[^\w\u0080-\u02af]+|^)

to check for start of word, and

(?=[^\w\u0080-\u02af]|$)

to check for the end of it.

The [^\w\u0080-\u02af] matches any characters not (^) being basic Latin word characters - \w - or the Unicode 1_Supplement, Extended-A, Extended-B and Extensions. This include some punctuation, but would get very long to match just letters. It may also have to be extended if other character sets have to be included. See for example Wikipedia.

Since javascript doesn't support look-behinds, the start-of-word test consumes any before mentioned non-word characters, but I don't think that should be a problem. The important thing is that the end-of-word test doesn't.

Also, putting these test outside a non capturing group that alternates the words, makes it significantly more effective.

var bannedWords = ["bad", "mad", "testing", "băţ", "båt", "süß"],
    regex = new RegExp('(?:[^\\w\\u00c0-\\u02af]+|^)(?:' + bannedWords.join("|") + ')(?=[^\\w\\u00c0-\\u02af]|$)', 'i');

function myFunction() {
    document.getElementById('result').innerHTML = 'Banned = ' + regex.test(document.getElementById('word_to_check').value);
}

<!DOCTYPE html>
<html>
<body>

Enter word: <input type='text' id='word_to_check'>
<button onclick='myFunction()'>Test</button>

<p id='result'></p>

</body>
</html>

edited Sep 2, 2016 at 10:04

answered Sep 1, 2016 at 13:01

SamWhan

8,3621 gold badge21 silver badges46 bronze badges

5 Comments

myf Over a year ago

You forgot to escape backslashes in string literals. Also this will let pass values like bad!!1! which I assume should be blocked.

SamWhan Over a year ago

Thanks @myf for pointing that out. I believe it's fixed now :)

myf Over a year ago

better :] although now it bans values such as băţăţ, which I assume should be permitted.

SamWhan Over a year ago

@myf Think that's fixed now as well. :S

myf Over a year ago

Yup, even better. Just look after those _bad_ underscores that leaked along with \w :]

TolMera · Accepted Answer · 2016-09-01 09:36:14Z

When dealing with characters outside my base set (which can show up at any time), I convert them to an appropriate base equivalent (8bit, 16bit, 32bit). before running any character matching over them.

var bannedWords = ["bad", "mad", "testing", "băţ"];
var bannedWordsBits = {};
bannedWords.forEach(function(word){
  bannedWordsBits[word] = "";
  for (var i = 0; i < word.length; i++){
    bannedWordsBits[word] += word.charCodeAt(i).toString(16) + "-";
  }
});
var bannedWordsJoin = []
var keys = Object.keys(bannedWordsBits);
keys.forEach(function(key){
  bannedWordsJoin.push(bannedWordsBits[key]);
});
var regex = new RegExp(bannedWordsJoin.join("|"), 'i');

function checkword(word) {
  var wordBits = "";
  for (var i = 0; i < word.length; i++){
    wordBits += word.charCodeAt(i).toString(16) + "-";
  }
  return !regex.test(wordBits);
};

The separator "-" is there to make sure that unique characters don't bleed together creating undesired matches.

Very useful as it brings all the characters down to a common base that everything can interact with. And this can be re-encoded back to it's original without having to ship it in key/value pair.

For me the best thing about it is that I don't have to know all of the rules for all of the character sets that I might intersect with, because I can pull them all into a common playing field.

As a side note:

To speed things up, rather than passing the large regex statement that you probably have, which takes exponentially longer to pass with the length of the words that you're banning, I would pass each separate word in the sentence through the filter. And break the filter up into length based segments. like;

checkword3Chars();
checkword4Chars();
checkword5chars();

who's functions you can generate systematically and even create on the fly as and when they become required.

Collectives™ on Stack Overflow

How to ban words with diacritics using a blacklist array and regex?

5 Answers 5

3 Comments

Comments

Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related