4

Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?

E.g.:

var xr = RegExp("\\bst","g");
xr.test("The string") // --> true

I need the same logic for Japanese strings.

3
  • I don't understand, what is \\bst? Commented May 11, 2013 at 1:43
  • A way to match the boundaries between Han, Hiragana, and Katakana would assist but not solve this problem on its own. So far I can't even find a way to match those, even with xregexp. You may be interested in a question I just asked about that: stackoverflow.com/questions/16492933/… Commented May 11, 2013 at 1:47
  • For Japanese it would be better to use a full morphological analyzer. Here's one in JavaScript: github.com/takuyaa/kuromoji.js Commented Oct 1, 2015 at 8:56

2 Answers 2

6

However, the actual problem of separating the Japanese sentence into words is more complicated than it appears, since words are not separated into spaces as is the case, for example, in English.

For example, the sentence 私はマーケットに行きました。 ("I went to the market") has the following words:

  • 私 - watakushi
  • は - wa
  • マーケット - maaketto
  • に - ni
  • 行きました - ikimashita
  • 。 - (period)

A reliable parser of Japanese sentences would, among other things, have to find where the particles (wa and ni) lie in the sentence, in order to find the remaining words.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, this is really hard; you have to have big dictionaries of words, and heuristics for guessing what words are more likely to be meant when a sequence of characters (especially kana) are used. It's possible to make puns where you could read a sentence in more than one way, so ultimately the task is not completely solvable, and there's very little you can do with tools as blunt as regex (never mind JavaScript's Unicode-ignorant regexps).
4

\b, as well as \w and \W, isn't Unicode-aware in JavaScript. You have to define your word boundaries as a specific character set. Like (^|$|[\s.,:\u3002]+) or similar.

\u3002 is from ('。'.charCodeAt(0)).toString(16). Is it a punctuation symbol in Japanese?

Or, a contrario, define a Unicode range of word-constructing letters and negate it:

var boundaries = /(^|$|\s+|[^\u30A0–\u30FA]+)/g;

The example katakana range taken from http://www.unicode.org/charts/PDF/U30A0.pdf.

2 Comments

I think yes. '。' is a punctuation symbol
Yes, it is a full stop, one of the few reliable ways of splitting at word (sentence) boundaries. Doing better than that is very hard (as per Peter's answer).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.