1

I'm doing a project with Twitter and one part is to take out all emoticons in a tweet so it doesn't trip the parser. I took a look at Carnegie Mellon's Ark Tweet NLP and it's pretty amazing and they have this really nice Java Regex pattern to detect emoticons!

However, I'm not exactly familiar with Java's regex syntax (I'm familiar with the basic ones)

https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java

The code I need to convert to Scala looks like this:

public static String emoticon = OR(
        // Standard version  :) :( :] :D :P
        "(?:>|>)?" + OR(normalEyes, wink) + OR(noseArea,"[Oo]") + 
            OR(tongue+"(?=\\W|$|RT|rt|Rt)", otherMouths+"(?=\\W|$|RT|rt|Rt)", sadMouths, happyMouths),


        // reversed version (: D:  use positive lookbehind to remove "(word):"
        // because eyes on the right side is more ambiguous with the standard usage of : ;
        "(?<=(?: |^))" + OR(sadMouths,happyMouths,otherMouths) + noseArea + OR(normalEyes, wink) + "(?:<|&lt;)?",


        //inspired by http://en.wikipedia.org/wiki/User:Scapler/emoticons#East_Asian_style
        eastEmote.replaceFirst("2", "1"), basicface
        // iOS 'emoji' characters (some smileys, some symbols) [\ue001-\uebbb]  
        // TODO should try a big precompiled lexicon from Wikipedia, Dan Ramage told me (BTO) he does this
);

The OR operator is a bit confusing.

So can anyone let me know how to do the conversion? Also after the conversion, all I need to do is a quick splitting tweets into words and see word.contains(emoticon) right? Thank you!


It seems like the above question is rather idiotic. However, there's the last bit of task I don't know:

I'm taking those emoticons out of my sentence. Will it work if I just split my sentences by space into words and do for (word <- words if !word.contains(regexpattern))?

2
  • 1
    There's no OR operator. It's a static method in the Twokenize class. Commented Jul 3, 2014 at 13:38
  • Change public static String emoticon to val emoticon: String and this could be Scala code. Scala uses the same regex engine as Java and could use the arktweetnlp library as well. Commented Jul 3, 2014 at 13:45

1 Answer 1

2

You can use this function:

def OR(patterns : String*) = patterns.map{p => s"(?:$p)"}.mkString("|")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.