3

I have a regex that needs to match a bunch of characters. The code has no problem is ruby 1.8.7 but in 1.9 it kills over. I guess it has to do with encoding, I've done a good chunk of google searches so maybe someone can enlighten me.

Code:

# encoding: utf-8
non_latin_hashtag_chars = [
  (0xA960..0xA97F).to_a, # Hangul Jamo Extended-A
  (0xAC00..0xD7AF).to_a, # Hangul Syllables
  (0xD7B0..0xD7FF).to_a  # Hangul Jamo Extended-B
].flatten.pack('U*').freeze

e = /[a-z_#{non_latin_hashtag_chars}]/io

Error:

~/Desktop: ruby regex_test.rb 
regex_test.rb:13:in `<main>': too many multibyte code ranges are specified: /[a-z_가각갂갃간갅갆갇갈갉갊갋갌갍갎갏감갑값갓갔강갖갗갘같갚갛개객갞갟갠갡갢갣갤갥갦갧갨갩갪갫갬갭갮갯갰갱갲갳갴갵갶갷갸갹갺갻갼갽갾갿걀걁걂걃걄걅걆걇걈걉걊걋걌걍......
1
  • Do you mean "it keels over", not "it kills over"? Commented Jul 4, 2011 at 22:45

2 Answers 2

7

As twehad points out, there is a 10k limit in regexp.

In anycase, you should use unicode ranges within the Regexp:

/[a-z_\uA960-\uA97F\uAC00-\uD7AF\uD7B0-\uD7FF]/io

I'm not an expert in Korean so I don't know if it is equivalent, but if you want to match all Hangul characters, you should use the class for that instead:

/[a-z_\p{Hangul}]/io
Sign up to request clarification or add additional context in comments.

1 Comment

I think this is the most practical solution. Breaking the individual ranges into separate regex checks joined with a || would be my second choice. Order the tests by the most likely to least likely to be encountered might buy some speed.
4

This is the limit 10000 multibyte сhar in regex.

You need to change ONIG_MAX_MULTI_BYTE_RANGES_NUM config parameters(/ruby-1.9.2-p*/include/ruby/oniguruma.h):

#define ONIG_MAX_MULTI_BYTE_RANGES_NUM     10000

and then recompile ruby.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.