I would be curious to see what are the average metrics of written English on one side, and code on the other side.
- length of paragraphs
- length of lines
- size of words
- chars used
- ratio between alphabetic, numeric and other symbol characters
- number of symbols per word
- etc.
Maybe that alone could discriminate already between code and the rest. At least I believe code, regardless of language, would show some noticeably different metrics in many cases.
The good news is: you already have plenty of data to build your statistics upon.
Ok I'm back with some data to back my assumptions up. :-)
I did a quick and dirty test on your own post and on the first post I found on StackOverflowthe first post I found on StackOverflow, with a pretty advanced tool: wc.
Here is what I had after running wc on the text part and on the code part of those two examples:
First lets look at the English part:
- The English part of your post (2635 chars, 468 words, 32 lines)
- 5 chars/word, 82 chars/line, 14 words/line
- The English part of the other post (1499 chars, 237 words, 12 lines)
- 6 chars/word, 124 chars/line, 19 words/line
Pretty similar don't you think?
Now lets take a look at the code part!
- The code part of your post (174 chars, 13 words, 3 lines)
- 13 chars/word, 58 chars/line, 4 words/line
- The code part of the other post (4181 chars, 287 words, 151 lines)
- 14 chars/word, 27 chars/line, 2 words/line
See how not so different those metrics are, but more importantly, how different they are from the English metrics? And this is just using a limited tool. I am now sure you can get something really accurate by measuring more metrics (I'm thinking in particular of chars statistics).
I can haz cookie?