Is there any library available that can tokenize source code written in different programming languages (java/C/C++)? (can possible identify part of it like starting and ending of a function, which are identifiers). I do not want to parse the source code, that can be overly complex. Moreover the source code may not be error free. Thanking in advance.
-
1Tokenizing even error-free code samples is non-trivial; and certainly nothing exists that works for "any language". You'll probably have to be much more specific about what you're trying to solve -- otherwise, I suggest getting really cozy with flex and bison or ANTLR.sarnold– sarnold2012-04-26 00:17:49 +00:00Commented Apr 26, 2012 at 0:17
2 Answers
You can tokenize source code using a lexical analyzer (or lexer, for short) like flex (under C) or JLex (under Java). The easiest way to get grammars to tokenize Java, C, and C++ may be to use (subject to licensing terms) the code from an open source compiler using your favorite lexer. Even if you find the licensing conditions too onerous, they should be educational to look through...
However, you still won't be able to identify the beginning and end of a function without parsing.
1 Comment
Not in all cases. Consider, for example, how parsing C or C++ code changes in the presence of typedef; a token that initially is an identifier must subsequently be recognized as a typename, if you don't then you will not be able to properly recognize declarations (including functions) using the typedef. Some languages allow you to define arbitrary operators (new tokens). Some are simply pathological (try designing a Perl parser, or Haskell '98 with the broken brace-insertion rule).