Javascript RegExp + Word boundaries + unicode characters

Question

I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å

When user types text in to the search input field I try to match the text to data.

Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"

var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";

// does not work
//var searchterm = "ää";

// Works
//var searchterm = "wi";

if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {
    $("#result").html("Match: ("+searchterm+"): "+title);
} else {
    $("#result").html("nothing found with term: "+searchterm);   
}

http://jsfiddle.net/7TsxB/

So how can I get those ä,ö and å characters to work with javascript regex?

I think I should use unicode codes but how should I do that? Codes for those characters are:

[\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]
=> äÄåÅöÖ

@Walkerneo: \b means "word boundary" in a regex; the slash is escaped here because it's in a string. — apsillers
– apsillers, Commented May 14, 2012 at 20:05
I use the \b because I want to match at the beginning of each word. — user1394520
– user1394520, Commented May 14, 2012 at 20:16
As you see, Javascript is stuck in the idiotic 1960’s-style ASCII-only mentality. It does not meet even the most basic conformance requirements needed for Level 1’s “Basic Unicode Support” per UTS#18 on Unicode Regular Expressions. Trying to do real Unicode text-processing work in Javascript an awful joke, and a cruel one, too: it cannot be done. The XRegexp plugin mentioned below is necessary but not sufficient for these purposes. — tchrist
– tchrist, Commented May 16, 2012 at 16:27
Newcomers beware: This cannot be done in regexp. Not with \b, not with \s, not with XRegExp, not with lookaheads or lookarounds. Believe me, I've tried it all, and everything broke in some or other way. The only reliable way I've found that up until now works is encoding the unicode string back to ascii and perform an ascii only regexp search/replace with \b as originally intended. See here: stackoverflow.com/a/10590188/1329367 — Mahn
– Mahn, Commented Oct 12, 2015 at 19:57

mowwwalker · Accepted Answer · 2014-08-05 08:38:54Z

50

There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.

Instead of using \b, try using (?:^|\\s)

var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";

// does not work
//var searchterm = "ää";

// Works
//var searchterm = "wi";

if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {
    $("#result").html("Match: ("+searchterm+"): "+title);
} else {
    $("#result").html("nothing found with term: "+searchterm);   
}

Breakdown:

(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together

^ the caret symbol matches the beginning of a string

| the bar is the "or" operator.

\s matches whitespace (appears as \\s in the string because we have to escape the backslash)

) closes the group

So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.

edited Aug 5, 2014 at 8:38

answered May 14, 2012 at 20:25

mowwwalker

17.5k30 gold badges109 silver badges166 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

FirstVertex Over a year ago

"try this" isn't a solution. Give some information about why the suggested regex works. What does (?:^|\\s) really do? You don't explain this solution at all.

Lea Verou Over a year ago

This is NOT a correct solution. (?:^|\\s) is not a zero-width assertion like \b is, and will consume characters from the match. A positive lookahead would be a better idea ((?=^|\\s)) but would only work after the match, as lookbehind is still not supported. Also, word boundaries are not just spaces and string boundaries, but a ton of other characters.

Ron Inbar Over a year ago

Is there any reason not to include $ (end of string) in the regex? I.e. (?:^|\s|$)

user3871 Over a year ago

This also matches partial string matches. '¿dónde está la alcaldesa?': es and está are matched, which is bad. Only está should be matched. \\b is supposed to be helpful with full-word boundaries.

Jonas Sourlier Over a year ago

@LeaVerou do you have a better solution?

|

Noah Freitas · Accepted Answer · 2012-05-14 20:33:21Z

23

The \b character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b is a shortcut code for the boundary between \w and \W sets or \w and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w is equal to [a-zA-Z0-9_] and \W is the negation of that class.

This makes the RegEx character classes largely useless for dealing with any real language.

\s should work for what you want to do, provided that search terms are only delimited by whitespace.

answered May 14, 2012 at 20:33

Noah Freitas

17.5k11 gold badges53 silver badges67 bronze badges

1 Comment

Alan Moore Over a year ago

+1, but \b is not a character class shorthand like \w and \s, it's a zero-width assertion like \A, $, and lookarounds.

Flimm · Accepted Answer · 2019-12-10 13:25:03Z

15

this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters. Using XRegExp library you can implement a valid \b boundary expanding this

XRegExp('(?=^|$|[^\\p{L}])')

the result is a 4000+ char long, but it seems to work quite performing.

Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.

edited Dec 10, 2019 at 13:25

Flimm

154k49 gold badges282 silver badges295 bronze badges

answered Sep 13, 2015 at 21:44

max masetti

1511 silver badge3 bronze badges

1 Comment

Patrick Janser Over a year ago

Effectively, I'm still surprised to see that in 2024 JS's regex engine still doesn't convert \b to [\p{L}\p{N}_] with the u or v flags, compared to most of all the other regex engines. But now, you don't need any more the XRegExp library as both the u and v flags in vanilla JS let us use the Unicode properties. You can also replace [^\p{L}] by \P{L}.

andrefs · Accepted Answer · 2019-07-31 12:30:06Z

13

\b is a shortcut for the transition between a letter and a non-letter character, or vice-versa.

Updating and improving on max_masseti's answer:

With the introduction of the /u modifier for RegExs in ES2018, you can now use \p{L} to represent any unicode letter, and \P{L} (notice the uppercase P) to represent anything but.

EDIT: Previous version was incomplete.

As such:

const text = 'A Fé, o Império, e as terras viciosas';

text.split(/(?<=\p{L})(?=\P{L})|(?<=\P{L})(?=\p{L})/);

// ['A', ' Fé', ',', ' o', ' Império', ',', ' e', ' as', ' terras', ' viciosas']

We're using a lookbehind (?<=...) to find a letter and a lookahead (?=...) to find a non-letter, or vice versa.

edited Jul 31, 2019 at 12:30

answered Jul 31, 2019 at 12:15

andrefs

6035 silver badges16 bronze badges

4 Comments

loretoparisi Over a year ago

Pretty cool, I was using (?<!\\S)$1(?!\\S) for unicode word match.

loretoparisi Over a year ago

I actually have tried (?<=^|\P{L})xxx(?=\P{L}|$) but it does not work properly actually, at least in JavaScript.

Lucas Werkmeister Over a year ago

Note that lookbehind actually has worse browser support than the /u modifier – “everyone” except IE has /u, but Safari and related browsers don’t have lookbehind yet.

Anthony Heaney Over a year ago

This was the answer I was looking for to deal with title casing a name that could use accents. name.toLowerCase().replace(/(\P{L}\p{L})|(^\p{L})/gu, function(a){ return a.toUpperCase()});

ypid · Accepted Answer · 2014-06-29 19:01:41Z

7

I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.

edited Jun 29, 2014 at 19:01

ypid

1,8881 gold badge15 silver badges11 bronze badges

answered May 14, 2012 at 21:23

micnic

11.3k5 gold badges46 silver badges56 bronze badges

Comments

Mariia Abramyk · Accepted Answer · 2020-08-13 14:24:47Z

4

Despite the fact the issue seems to be 8 years old, I run into a similar problem (I had to match Cyrillic letters) not so far ago. I spend a whole day on this and could not find any appropriate answer here on StackOverflow. So, to avoid others making lots of effort, I'd like to share my solution.

Yes, \b word boundary works only with Latin letters (Word boundary: \b):

Word boundary \b doesn’t work for non-Latin alphabets The word boundary test \b checks that there should be \w on the one side from the position and "not \w" – on the other side. But \w means a Latin letter a-z (or a digit or an underscore), so the test doesn’t work for other characters, e.g. Cyrillic letters or hieroglyphs.

Yes, JavaScript RegExp implementation hardly supports UTF-8 encoding.

So, I tried implementing own word boundary feature with the support of non-Latin characters. To make word boundary work just with Cyrillic characters I created such regular expression:

new RegExp(`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,'gi')

Where \u0400-\u04ff is a range of Cyrillic characters provided in the table of codes. It is not an ideal solution, however, it works properly in most cases.

To make it work in your case, you just have to pick up an appropriate range of codes from the list of Unicode characters.

To try out my example run the code snippet below.

function getMatchExpression(cyrillicSearchValue) {
  return new RegExp(
    `(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,
    'gi',
  );
}

const sentence = 'Будь-який текст кирилицею, де необхідно знайти слово з контексту';

console.log(sentence.match(getMatchExpression('текст')));
// expected output: ["текст"]


console.log(sentence.match(getMatchExpression('но')));
// expected output: null

edited Aug 13, 2020 at 14:24

answered Aug 13, 2020 at 8:51

Mariia Abramyk

1762 silver badges7 bronze badges

3 Comments

loretoparisi Over a year ago

It will work simply using (?<!\\S)${cyrillicSearchValue}(?!\\S)

Даниил Пронин Over a year ago

it not works in Webkit (Safari). SyntaxError: Invalid regular expression group specifier name

Michael T Over a year ago

@loretoparisi Your regex won't always work. For example if the string was '.текст', \S will not detect the start of the word correctly.

apsillers · Accepted Answer · 2012-05-14 20:30:30Z

2

I noticed something really weird with \b when using Unicode:

/\bo/.test("pop"); // false (obviously)
/\bä/.test("päp"); // true (what..?)

/\Bo/.test("pop"); // true
/\Bä/.test("päp"); // false (what..?)

It appears that meaning of \b and \B are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.

In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace \b with (^|[\s\\/-_&]), as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)

edited May 14, 2012 at 20:30

answered May 14, 2012 at 20:18

apsillers

116k18 gold badges248 silver badges249 bronze badges

1 Comment

Tim Pietzcker Over a year ago

\b and \B aren't Unicode-aware in JavaScript, so they consider ä a non-alphanumeric character and therefore see a word boundary between p and ä.

Daniel Centore · Accepted Answer · 2021-12-06 04:28:55Z

I had a similar problem, where I was trying to replace all of a particular unicode word with a different unicode word, and I cannot use lookbehind because it's not supported in the JS engine this code will be used in. I ultimately resolved it like this:

const needle = "КАРТОПЛЯ";
const replace = "БАРАБОЛЯ";
const regex = new RegExp(
  String.raw`(^|[^\n\p{L}])`
    + needle
    + String.raw`(?=$|\P{L})`,
   "gimu",
);

const result = (
    'КАРТОПЛЯ сдффКАРТОПЛЯдадф КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ??? !!!КАРТОПЛЯ ;!;!КАРТОПЛЯ/#?#?'
    + '\n\nКАРТОПЛЯ КАРТОПЛЯ - - -КАРТОПЛЯ--'
  )
    .replace(regex, function (match, ...args) {
      return args[0] + replace;
    });
console.log(result)

output:

БАРАБОЛЯ сдффКАРТОПЛЯдадф БАРАБОЛЯ БАРАБОЛЯ БАРАБОЛЯ??? !!!БАРАБОЛЯ ;!;!БАРАБОЛЯ/#?#?

БАРАБОЛЯ БАРАБОЛЯ - - -БАРАБОЛЯ--

Breaking it apart

The first regex: (^|[^\n\p{L}])

^| = Start of the line or
[^\n\p{L}] = Any character which is not a letter or a newline

The second regex: (?=$|\P{L})

?= = Lookahead
$| = End of the line or
\P{L} = Any character which is not a letter

The first regex captures the group and is then used via args[0] to put it back into the string during replacement, thereby avoiding a lookbehind. The second regex utilized lookahead.

Note that the second one MUST be a lookahead because if we capture it then overlapping regex matches will not trigger (e.g. КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ would only match on the 1st and 3rd ones).

works! but when the number is first letter of string it works with mistake.

Régis · Accepted Answer · 2022-08-12 14:29:05Z

2

Trying to find text "myTest":

/(?<![\p{L}\p{N}_])myTest(?![\p{L}\p{N}_])/gu

Similar to NetBeans or Notepad++ form. Trying to find the expression without any letter or number or underscore (like \w characters of word boundary \b) in any unicode characters of letter and number before or after the expression.

edited Aug 12, 2022 at 14:29

answered Aug 12, 2022 at 14:24

Régis

212 bronze badges

Comments

Heitor Chang · Accepted Answer · 2012-05-14 20:32:23Z

1

My idea is to search with codes representing the Finnish letters

new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))

My original idea was to use plain encodeURI but the % sign seemed to interfere with the regexp.

http://jsfiddle.net/7TsxB/5/

I wrote a crude function using encodeURI to encode every character with code over 128 but removing its % and adding 'QQ' in the beginning. It is not the best marker but I couldn't get non alphanumeric to work.

edited May 14, 2012 at 20:32

answered May 14, 2012 at 19:58

Heitor Chang

6,0573 gold badges50 silver badges67 bronze badges

3 Comments

Mahn Over a year ago

This is a great idea, and the only thing that worked for me. Instead of QQ you can use a control string of ___ which is a bit safer and still ascii, and instead of encodeURI you can leverage javascript's native escape/unescape methods, but otherwise it does the job.

petriq Over a year ago

this is not good solution for those who want to do something with matched substring

Jānis Elmeris Over a year ago

Does this assume any non-ASCII character to be a word character? For example, "äl" wouldn't be treated as the beginning of a word in "👽älkää", although it should be.

Ed. · Accepted Answer · 2016-03-14 14:30:26Z

1

What you are looking for is the Unicode word boundaries standard:

http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries

There is a JavaScript implementation here (unciodejs.wordbreak.js)

https://github.com/wikimedia/unicodejs

answered Mar 14, 2016 at 14:30

Ed.

2,0792 gold badges14 silver badges6 bronze badges

3 Comments

Flimm Over a year ago

I don't think Javascript follows the Unicode standard in this regard.

kontur Over a year ago

An interesting resource in this context nonetheless!

loretoparisi Over a year ago

that's pretty cool, but not clear how to use it in this context.

Lionel Rowe · Accepted Answer · 2023-07-17 08:32:45Z

None of the current answers are suitable for general-purpose use, so below is what I'm using.

Important things to note:

We use positive lookaround for the word sides of the boundary, and negative lookaround for the non-word sides. This is different from using positive lookaround with a negated character class, as the latter won't match the start or end of the input string.
w is roughly equivalent to the concept of a "word-like character" (including numbers and diacritics), but you might want to use a different definition depending on your use case or the characteristics of your target languages.
s matches the start of a word, and e matches the end. b matches either and so is the closest in semantics to \b, but usually it's better to just use s or e alone for clarity and performance reasons, because they're mutually exlusive.

// Word character: letter, mark (diacritics), or number.
// Add/remove more characters and character classes as desired,
// e.g. you might want to add _ for greater equivalence with \b
const w = /[\p{L}\p{M}\p{N}]/u.source

// Start of word
const s = new RegExp(`(?:(?<!${w})(?=${w}))`, 'u').source
// End of word
const e = new RegExp(`(?:(?<=${w})(?!${w}))`, 'u').source
// Word break
const b = new RegExp(`(?:${s}|${e})`, 'u').source

console.log('Compiled source:', b)

// Usage:
const regex = new RegExp(`${s}(?:word|单?词)${e}`, 'giu')

const text = `Word more content word-with-hyphen
foreword, wordless, forewordless (not matched)
单词, comma,
end of line: 词
at end: WORD`

console.log('Results:', text.replaceAll(regex, '[[$&]]'))

Depending on your use case, you might also find that using Intl.Segmenter gives the best results. You can even try using a character that isn't present in the input string as a makeshift delimiter, then matching on that:

const DELIM = '⍼'

function addDelims(text) {
    const segments = [...new Intl.Segmenter('en-US', { granularity: 'word' }).segment(text)]
    
    return DELIM + segments.map((s) => s.segment).join(DELIM) + DELIM
}

function stripDelims(text) {
    return text.replaceAll(DELIM, '')
}

const text = `word foreword wordlike WORD 词 单词 word`
const withDelims = addDelims(text)
const replaced = withDelims.replaceAll(new RegExp(`${DELIM}(?:${'word|单?词'})${DELIM}`, 'giu'), '[[$&]]')
const stripped = stripDelims(replaced)

console.log({
    withDelims,
    replaced,
    stripped,
})

Antonín Slejška · Accepted Answer · 2015-06-24 13:07:18Z

I have had a similar problem, but I had to replace an array of terms. All solutions, which I have found did not worked, if two terms were in the text next to each other (because their boundaries overlaped). So I had to use a little modified approach:

var text = "Ještě. že; \"už\" à. Fürs, 'anlässlich' že že že.";
var terms = ["à","anlässlich","Fürs","už","Ještě", "že"];
var replaced = [];
var order = 0;
for (i = 0; i < terms.length; i++) {
    terms[i] = "(^\|[ \n\r\t.,;'\"\+!?-])(" + terms[i] + ")([ \n\r\t.,;'\"\+!?-]+\|$)";
}
var re = new RegExp(terms.join("|"), "");
while (true) {
    var replacedString = "";
    text = text.replace(re, function replacer(match){
        var beginning = match.match("^[ \n\r\t.,;'\"\+!?-]+");
        if (beginning == null) beginning = "";
        var ending = match.match("[ \n\r\t.,;'\"\+!?-]+$");
        if (ending == null) ending = "";
        replacedString = match.replace(beginning,"");
        replacedString = replacedString.replace(ending,"");
        replaced.push(replacedString);
        return beginning+"{{"+order+"}}"+ending;
    });
if (replacedString == "") break;
order += 1;
}

See the code in a fiddle: http://jsfiddle.net/antoninslejska/bvbLpdos/1/

The regular expression is inspired by: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular

I can't say, that I find the solution elegant...

Manthos · Accepted Answer · 2020-02-12 12:28:14Z

0

The correct answer to the question is given by andrefs. I will only rewrite it more clearly, after putting all required things together.

For ASCII text, you can use \b for matching a word boundary both at the start and the end of a pattern. When using Unicode text, you need to use 2 different patterns for doing the same:

Use (?<=^|\P{L}) for matching the start or a word boundary before the main pattern.
Use (?=\P{L}|$) for matching the end or a word boundary after the main pattern.
Additionally, use (?i) in the beginning of everything, to make all those matchings case-insensitive.

So the resulting answer is: (?i)(?<=^|\P{L})xxx(?=\P{L}|$), where xxx is your main pattern. This would be the equivalent of (?i)\bxxx\b for ASCII text.

For your code to work, you now need to do the following:

Assign to your variable "searchterm", the pattern or words you want to find.
Escape the variable's contents. For example, replace '\' with '\\' and also do the same for any reserved special character of regex, like '\^', '\$', '\/', etc. Check here for a question on how to do this.
Insert the variable's contents to the pattern above, in the place of "xxx", by simply using the string.replace() method.

edited Feb 12, 2020 at 12:28

answered Feb 12, 2020 at 12:14

Manthos

497 bronze badges

3 Comments

loretoparisi Over a year ago

Thanks, but I get a SyntaxError: Invalid regular expression: /(?i)(?<=^|P{L})äl(?=P{L}|$)/: Invalid group when using new RegExp(pattern.replace('xxx', searchterm), "g"); for var pattern = '(?i)(?<=^|\P{L})xxx(?=\P{L}|$)'.

loretoparisi Over a year ago

So the error was due to the (?i). If I remove it I get /(?<=^|P{L})äl(?=P{L}|$)/g, but when exec I have no match.

Manthos Over a year ago

@loretoparisi My response described the answer using the PCRE (PHP) flavor for writing regular expressions. You get the error because you are applying it in an environment that uses the ECMAScript flavor. For it to properly work, you need to modify it by removing the 1st term and adding the i modifier: /(?<=^|\P{L})xxx(?=\P{L}|$)/gmi

Angry Fox · Accepted Answer · 2022-05-30 08:09:53Z

0

bad but working:

var text = " аб аб АБ абвг ";
var ttt = "(аб)"
var p = "(^|$|[^A-Za-zА-Я-а-я0-9()])"; // add other word boundary symbols here
var exp = new RegExp(p+ttt+p,"gi");
text = text.replace(exp, "$1($2)$3").replace(exp, "$1($2)$3");
const t1 = performance.now();
console.log(text);

result (without qutes):

" (аб) (аб) (АБ) абвг "

answered May 30, 2022 at 8:09

Angry Fox

611 silver badge10 bronze badges

Comments

Valentin · Accepted Answer · 2022-09-20 14:57:26Z

I struggled hard on this. Working with French accented characters, and I managed to find this solution :

const myString = "MyString";
const regex = new RegExp(
    "(?:[^À-ú]|^)\\b(" + myString + ")\\b(?:[^À-ú]|$)",
    "ig"
);

What id does : It keeps checking word-boundaries with \b before and after "MyString". In addition to that, (?:[^À-ú]|^) and (?:[^À-ú]|$) will check if MyString is not surrounded by any accented characters

It will not work with cyrillic but it may be possible to find the range of cirillic charactes and edit [^À-ú] in consequence.

Warning, it captures only the group (MyString) but the total match contains previous and next characters

See example : https://regex101.com/r/5P0ZIe/1

Match examples :

MyString
- match : "MyString"
- group 1 : "MyString"
Lorem ipsum. MyString dolor sit amet
- match : " MyString "
- group 1 : "MyString"
(MyString)
- match : "(MyString)"
- group 1 : "MyString"
BetweenCharactersMyStringIsNotFound
- match : Nothing
- group 1 : Nothing
éMyStringé
- match : Nothing
- group 1 : Nothing
ùMyString
- match : Nothing
- group 1 : Nothing
MyStringÖ
- match : Nothing
- group 1 : Nothing

Collectives™ on Stack Overflow

Javascript RegExp + Word boundaries + unicode characters

16 Answers 16

7 Comments

1 Comment

1 Comment

4 Comments

Comments

3 Comments

1 Comment

Breaking it apart

1 Comment

Comments

3 Comments

3 Comments

Comments

Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

7 Comments

1 Comment

1 Comment

4 Comments

Comments

3 Comments

1 Comment

Breaking it apart

1 Comment

Comments

3 Comments

3 Comments

Comments

Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related