Javascript regular expression for searching word boundaries in Unicode string

Question

Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?

E.g.:

var xr = RegExp("\\bst","g");
xr.test("The string") // --> true

I need the same logic for Japanese strings.

A way to match the boundaries between Han, Hiragana, and Katakana would assist but not solve this problem on its own. So far I can't even find a way to match those, even with xregexp. You may be interested in a question I just asked about that: stackoverflow.com/questions/16492933/… — hippietrail
– hippietrail, Commented May 11, 2013 at 1:47
For Japanese it would be better to use a full morphological analyzer. Here's one in JavaScript: github.com/takuyaa/kuromoji.js — katspaugh
– katspaugh, Commented Oct 1, 2015 at 8:56

Peter O. · Accepted Answer · 2011-10-28 11:19:03Z

6

However, the actual problem of separating the Japanese sentence into words is more complicated than it appears, since words are not separated into spaces as is the case, for example, in English.

For example, the sentence 私はマーケットに行きました。 ("I went to the market") has the following words:

私 - watakushi
は - wa
マーケット - maaketto
に - ni
行きました - ikimashita
。 - (period)

A reliable parser of Japanese sentences would, among other things, have to find where the particles (wa and ni) lie in the sentence, in order to find the remaining words.

answered Oct 28, 2011 at 11:19

Peter O.

33.1k14 gold badges86 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bobince Over a year ago

Yes, this is really hard; you have to have big dictionaries of words, and heuristics for guessing what words are more likely to be meant when a sequence of characters (especially kana) are used. It's possible to make puns where you could read a sentence in more than one way, so ultimately the task is not completely solvable, and there's very little you can do with tools as blunt as regex (never mind JavaScript's Unicode-ignorant regexps).

katspaugh · Accepted Answer · 2011-10-28 10:23:54Z

4

\b, as well as \w and \W, isn't Unicode-aware in JavaScript. You have to define your word boundaries as a specific character set. Like (^|$|[\s.,:\u3002]+) or similar.

\u3002 is from ('。'.charCodeAt(0)).toString(16). Is it a punctuation symbol in Japanese?

Or, a contrario, define a Unicode range of word-constructing letters and negate it:

var boundaries = /(^|$|\s+|[^\u30A0–\u30FA]+)/g;

The example katakana range taken from http://www.unicode.org/charts/PDF/U30A0.pdf.

edited Oct 28, 2011 at 10:23

answered Oct 28, 2011 at 10:08

katspaugh

18k12 gold badges69 silver badges107 bronze badges

2 Comments

Andrei Over a year ago

I think yes. '。' is a punctuation symbol

bobince Over a year ago

Yes, it is a full stop, one of the few reliable ways of splitting at word (sentence) boundaries. Doing better than that is very hard (as per Peter's answer).

Collectives™ on Stack Overflow

Javascript regular expression for searching word boundaries in Unicode string

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related