A proper solution would probably be some learned/statistical model, but here are some fun ideas:
- Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
- Parentheses directly following text with no space to separate it:
myFunc() - A dot or arrow between two words:
foo.bar = ptr->val - Presence of curly braces, brackets:
while (true) { bar[i]; } - Presence of "comment" syntax (/*, //, etc):
/* multi-line comment */ - Uncommon characters/operators:
+, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __ - Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
- camelCase text in the post.
- nested parentheses, braces, and/or brackets.
One could keep track of the number of times each of these appears, and these could be used as features in a machine-learning algorithm like perceptron, the way SpamAssassin does.