Regexp for Tokenizing English Text

Question

What would be the best regular expression for tokenizing an English text?

By an English token, I mean an atom consisting of maximum number of characters that can be meaningfully used for NLP purposes. An analogy is a "token" in any programming language (e.g. in C, '{', '[', 'hello', '&', etc. can be tokens). There is one restriction: Though English punctuation characters can be "meaningful", let's ignore them for the sake of simplicity when they do not appear in the middle of \w+. So, "Hello, world." yields 'hello' and 'world'; similarly, "You are good-looking." may yield either [you, are, good-looking] or [you, are, good, looking].

@OTZ: C has a formal specification. English has no such specification. You have to provide the specification of what you want. We can't guess what you are thinking. — Mark Byers
– Mark Byers, Commented Sep 13, 2010 at 20:19
You need to be more specific about what you want to consider a token. Should spaces be tokens? Punctuation marks? There are limitations to what you can do with a regular expression (e.g., distinguishing between ' used as an apostrophe versus a single quotation mark). — Adrian McCarthy
– Adrian McCarthy, Commented Sep 13, 2010 at 20:35

dmcer · Accepted Answer · 2010-09-14 00:18:05Z

5

Treebank Tokenization

Penn Treebank (PTB) tokenization is a reasonably common tokenization scheme used for natural language processing (NLP) work.

You can find a sed script with the appropriate regular expressions to get this tokenization here.

Software Packages

However, most NLP packages provide ready to use tokenizers, so you don't really need to write your own. For example, if you're using python you can just use the TreebankWordTokenizer provided with NLTK. If you're using the Java based Stanford Parser, it will by default tokenize any sentence you give it using its edu.stanford.nlp.processor.PTBTokenizer.

answered Sep 14, 2010 at 0:18

dmcer

8,1761 gold badge37 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

OTZ Over a year ago

Thanks for giving us a pointer to the PTB tokenization method. While they don't enumerate what those "subtleties" are on hyphens vs dashes, and I'm not sure if "won't --> wo n't" or "gonna --> gon na" is appropriate, it can be a starter. +1

Anderson Green Over a year ago

This link seems to be broken now.

Mark Byers · Accepted Answer · 2010-09-13 22:30:23Z

2

You probably shouldn't try to use a regular expression for tokenizing English text. In English some tokens have several different meanings and you can only know which is right by understanding the context in which they are found, and that requires understanding the meaning of the text to some extent. Examples:

The character ' could be an apostrophe or it could be used as a single-quote to quote some text.
The period could be the end of a sentence or it could signify an abbreviation. Or in some cases it could fulfil both roles simultaneously.

Try a natural language parser instead. For example you could use the Stanford Parser. It is free to use and will do a much better job than any regular expression at tokenizing English text. That's just one example though - there are also many other NLP libraries you could use.

edited Sep 13, 2010 at 22:30

answered Sep 13, 2010 at 20:00

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

4 Comments

Paul Nathan Over a year ago

tokenizing != parsing. He's talking about lexing (unless I miss my guess).

OTZ Over a year ago

@Nathan you got that right. Byers is referring to a tagger, which is not my focus.

Mark Byers Over a year ago

@Paul Nathan: You can't accurately tokenize English text using a regular expression. If you only want it to work some of the time and don't care about errors then you can probably get away with using a simple regular expression. If you want it to work most of the time then you need something more powerful. You could keep extending the regex to cover more and more special cases, but seeing as the more powerful solutions already exist and are free, why not just use them from the start?

Paul Nathan Over a year ago

Pain of integration, for one thing. :-) OP hasn't discussed his target corpus. If it's a basic analysis, a regex will work. If it's for a more precise problem, of course you want a more developed system. At a guess, OP wants a basic hack, since an expert would frame the question much more precisely. Also Perl regexes are not true regexes, they are context-sensitive somethings.

Colin Hebert · Accepted Answer · 2010-09-13 20:01:17Z

1

You can split on [^\p{L}]+. It will split on each characters group which doesn't contains letters.

Resources :

regular-expressions.info - unicode

answered Sep 13, 2010 at 20:01

Colin Hebert

93.5k15 gold badges164 silver badges154 bronze badges

1 Comment

OTZ Over a year ago

What's that \p doing? Which language's regexp library r u using?

Paul Nathan · Accepted Answer · 2010-09-13 20:02:54Z

0

There are some complexities.

A word will have [A-Za-z0-9\-]. But, you may have some other delimiters besides just the word! You can start with [(\s] and end with [),.-\s?:;!]

answered Sep 13, 2010 at 20:02

Paul Nathan

40.5k30 gold badges122 silver badges215 bronze badges

3 Comments

Rohan Singh Over a year ago

Noooo. Don't do this. Use \b instead. It matches a word boundary. So this would match a word: \b.+?\b

Daniel Vandersluis Over a year ago

\b won't work properly if the word contains non-ASCII characters!

Paul Nathan Over a year ago

@Rohan: That won't work for hyphenated words or apostrophe'd words. Also, this is not a full Perl regex. This is a sample regex meant to demonstrate in a non-Perl syntax a subset of possibility.

Collectives™ on Stack Overflow

Regexp for Tokenizing English Text

4 Answers 4

2 Comments

4 Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

4 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related