8

What would be the best regular expression for tokenizing an English text?

By an English token, I mean an atom consisting of maximum number of characters that can be meaningfully used for NLP purposes. An analogy is a "token" in any programming language (e.g. in C, '{', '[', 'hello', '&', etc. can be tokens). There is one restriction: Though English punctuation characters can be "meaningful", let's ignore them for the sake of simplicity when they do not appear in the middle of \w+. So, "Hello, world." yields 'hello' and 'world'; similarly, "You are good-looking." may yield either [you, are, good-looking] or [you, are, good, looking].

15
  • See this question about tokening in C++ using Roost.Regex. Commented Sep 13, 2010 at 20:00
  • 1
    possible duplicate of True definition of an English word? Commented Sep 13, 2010 at 20:08
  • @OTZ in English what is a token if not a word? Commented Sep 13, 2010 at 20:13
  • 2
    @OTZ: C has a formal specification. English has no such specification. You have to provide the specification of what you want. We can't guess what you are thinking. Commented Sep 13, 2010 at 20:19
  • 3
    You need to be more specific about what you want to consider a token. Should spaces be tokens? Punctuation marks? There are limitations to what you can do with a regular expression (e.g., distinguishing between ' used as an apostrophe versus a single quotation mark). Commented Sep 13, 2010 at 20:35

4 Answers 4

5

Treebank Tokenization

Penn Treebank (PTB) tokenization is a reasonably common tokenization scheme used for natural language processing (NLP) work.

You can find a sed script with the appropriate regular expressions to get this tokenization here.

Software Packages

However, most NLP packages provide ready to use tokenizers, so you don't really need to write your own. For example, if you're using python you can just use the TreebankWordTokenizer provided with NLTK. If you're using the Java based Stanford Parser, it will by default tokenize any sentence you give it using its edu.stanford.nlp.processor.PTBTokenizer.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for giving us a pointer to the PTB tokenization method. While they don't enumerate what those "subtleties" are on hyphens vs dashes, and I'm not sure if "won't --> wo n't" or "gonna --> gon na" is appropriate, it can be a starter. +1
This link seems to be broken now.
2

You probably shouldn't try to use a regular expression for tokenizing English text. In English some tokens have several different meanings and you can only know which is right by understanding the context in which they are found, and that requires understanding the meaning of the text to some extent. Examples:

  • The character ' could be an apostrophe or it could be used as a single-quote to quote some text.
  • The period could be the end of a sentence or it could signify an abbreviation. Or in some cases it could fulfil both roles simultaneously.

Try a natural language parser instead. For example you could use the Stanford Parser. It is free to use and will do a much better job than any regular expression at tokenizing English text. That's just one example though - there are also many other NLP libraries you could use.

4 Comments

tokenizing != parsing. He's talking about lexing (unless I miss my guess).
@Nathan you got that right. Byers is referring to a tagger, which is not my focus.
@Paul Nathan: You can't accurately tokenize English text using a regular expression. If you only want it to work some of the time and don't care about errors then you can probably get away with using a simple regular expression. If you want it to work most of the time then you need something more powerful. You could keep extending the regex to cover more and more special cases, but seeing as the more powerful solutions already exist and are free, why not just use them from the start?
Pain of integration, for one thing. :-) OP hasn't discussed his target corpus. If it's a basic analysis, a regex will work. If it's for a more precise problem, of course you want a more developed system. At a guess, OP wants a basic hack, since an expert would frame the question much more precisely. Also Perl regexes are not true regexes, they are context-sensitive somethings.
1

You can split on [^\p{L}]+. It will split on each characters group which doesn't contains letters.


Resources :

1 Comment

What's that \p doing? Which language's regexp library r u using?
0

There are some complexities.

A word will have [A-Za-z0-9\-]. But, you may have some other delimiters besides just the word! You can start with [(\s] and end with [),.-\s?:;!]

3 Comments

Noooo. Don't do this. Use \b instead. It matches a word boundary. So this would match a word: \b.+?\b
\b won't work properly if the word contains non-ASCII characters!
@Rohan: That won't work for hyphenated words or apostrophe'd words. Also, this is not a full Perl regex. This is a sample regex meant to demonstrate in a non-Perl syntax a subset of possibility.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.