Revisions to Simple method for reliably detecting code in text?

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 12:40

1

I would be curious to see what are the average metrics of written English on one side, and code on the other side.

length of paragraphs
length of lines
size of words
chars used
ratio between alphabetic, numeric and other symbol characters
number of symbols per word
etc.

Maybe that alone could discriminate already between code and the rest. At least I believe code, regardless of language, would show some noticeably different metrics in many cases.

The good news is: you already have plenty of data to build your statistics upon.

Ok I'm back with some data to back my assumptions up. :-)

I did a quick and dirty test on your own post and on the first post I found on StackOverflow the first post I found on StackOverflow, with a pretty advanced tool: wc.

Here is what I had after running wc on the text part and on the code part of those two examples:

First lets look at the English part:

The English part of your post (2635 chars, 468 words, 32 lines)
- 5 chars/word, 82 chars/line, 14 words/line
The English part of the other post (1499 chars, 237 words, 12 lines)
- 6 chars/word, 124 chars/line, 19 words/line

Pretty similar don't you think?

Now lets take a look at the code part!

The code part of your post (174 chars, 13 words, 3 lines)
- 13 chars/word, 58 chars/line, 4 words/line
The code part of the other post (4181 chars, 287 words, 151 lines)
- 14 chars/word, 27 chars/line, 2 words/line

See how not so different those metrics are, but more importantly, how different they are from the English metrics? And this is just using a limited tool. I am now sure you can get something really accurate by measuring more metrics (I'm thinking in particular of chars statistics).

I can haz cookie?

I would be curious to see what are the average metrics of written English on one side, and code on the other side.

length of paragraphs
length of lines
size of words
chars used
ratio between alphabetic, numeric and other symbol characters
number of symbols per word
etc.

Maybe that alone could discriminate already between code and the rest. At least I believe code, regardless of language, would show some noticeably different metrics in many cases.

The good news is: you already have plenty of data to build your statistics upon.

Ok I'm back with some data to back my assumptions up. :-)

I did a quick and dirty test on your own post and on the first post I found on StackOverflow, with a pretty advanced tool: wc.

Here is what I had after running wc on the text part and on the code part of those two examples:

First lets look at the English part:

The English part of your post (2635 chars, 468 words, 32 lines)
- 5 chars/word, 82 chars/line, 14 words/line
The English part of the other post (1499 chars, 237 words, 12 lines)
- 6 chars/word, 124 chars/line, 19 words/line

Pretty similar don't you think?

Now lets take a look at the code part!

The code part of your post (174 chars, 13 words, 3 lines)
- 13 chars/word, 58 chars/line, 4 words/line
The code part of the other post (4181 chars, 287 words, 151 lines)
- 14 chars/word, 27 chars/line, 2 words/line

See how not so different those metrics are, but more importantly, how different they are from the English metrics? And this is just using a limited tool. I am now sure you can get something really accurate by measuring more metrics (I'm thinking in particular of chars statistics).

I can haz cookie?

I would be curious to see what are the average metrics of written English on one side, and code on the other side.

length of paragraphs
length of lines
size of words
chars used
ratio between alphabetic, numeric and other symbol characters
number of symbols per word
etc.

Maybe that alone could discriminate already between code and the rest. At least I believe code, regardless of language, would show some noticeably different metrics in many cases.

The good news is: you already have plenty of data to build your statistics upon.

Ok I'm back with some data to back my assumptions up. :-)

I did a quick and dirty test on your own post and on the first post I found on StackOverflow, with a pretty advanced tool: wc.

Here is what I had after running wc on the text part and on the code part of those two examples:

First lets look at the English part:

The English part of your post (2635 chars, 468 words, 32 lines)
- 5 chars/word, 82 chars/line, 14 words/line
The English part of the other post (1499 chars, 237 words, 12 lines)
- 6 chars/word, 124 chars/line, 19 words/line

Pretty similar don't you think?

Now lets take a look at the code part!

The code part of your post (174 chars, 13 words, 3 lines)
- 13 chars/word, 58 chars/line, 4 words/line
The code part of the other post (4181 chars, 287 words, 151 lines)
- 14 chars/word, 27 chars/line, 2 words/line

See how not so different those metrics are, but more importantly, how different they are from the English metrics? And this is just using a limited tool. I am now sure you can get something really accurate by measuring more metrics (I'm thinking in particular of chars statistics).

I can haz cookie?

Post Made Community Wiki by Omar Kooheji

occurred Jun 28, 2011 at 9:43

TESTED!

Source Link

edited Jun 28, 2011 at 9:24

Julien Guertault

720
4
15

I would be curious to see what are the average metrics of written English on one side, and code on the other side.

length of paragraphs
length of lines
size of words
chars used
ratio between alphabetic, numeric and other symbol characters
number of symbols per word
etc.

Maybe that alone could discriminate already between code and the rest. At least I believe code, regardless of language, would show some noticeably different metrics in many cases.

The good news is: you already have plenty of data to build your statistics upon.

Ok I'm back with some data to back my assumptions up. :-)

I did a quick and dirty test on your own post and on the first post I found on StackOverflow, with a pretty advanced tool: wc.

Here is what I had after running wc on the text part and on the code part of those two examples:

First lets look at the English part:

The English part of your post (2635 chars, 468 words, 32 lines)
- 5 chars/word, 82 chars/line, 14 words/line

The English part of the other post (1499 chars, 237 words, 12 lines)
- 6 chars/word, 124 chars/line, 19 words/line

Pretty similar don't you think?

Now lets take a look at the code part!

The code part of your post (174 chars, 13 words, 3 lines)
- 13 chars/word, 58 chars/line, 4 words/line

The code part of the other post (4181 chars, 287 words, 151 lines)
- 14 chars/word, 27 chars/line, 2 words/line

See how not so different those metrics are, but more importantly, how different they are from the English metrics? And this is just using a limited tool. I am now sure you can get something really accurate by measuring more metrics (I'm thinking in particular of chars statistics).

I can haz cookie?

I would be curious to see what are the average metrics of written English on one side, and code on the other side.

length of paragraphs
length of lines
size of words
chars used
ratio between alphabetic, numeric and other symbol characters
number of symbols per word
etc.

Maybe that alone could discriminate already between code and the rest. At least I believe code, regardless of language, would show some noticeably different metrics in many cases.

The good news is: you already have plenty of data to build your statistics upon.

I would be curious to see what are the average metrics of written English on one side, and code on the other side.

length of paragraphs
length of lines
size of words
chars used
ratio between alphabetic, numeric and other symbol characters
number of symbols per word
etc.

Maybe that alone could discriminate already between code and the rest. At least I believe code, regardless of language, would show some noticeably different metrics in many cases.

The good news is: you already have plenty of data to build your statistics upon.

Ok I'm back with some data to back my assumptions up. :-)

I did a quick and dirty test on your own post and on the first post I found on StackOverflow, with a pretty advanced tool: wc.

Here is what I had after running wc on the text part and on the code part of those two examples:

First lets look at the English part:

The English part of your post (2635 chars, 468 words, 32 lines)
- 5 chars/word, 82 chars/line, 14 words/line

The English part of the other post (1499 chars, 237 words, 12 lines)
- 6 chars/word, 124 chars/line, 19 words/line

Pretty similar don't you think?

Now lets take a look at the code part!

The code part of your post (174 chars, 13 words, 3 lines)
- 13 chars/word, 58 chars/line, 4 words/line

The code part of the other post (4181 chars, 287 words, 151 lines)
- 14 chars/word, 27 chars/line, 2 words/line

See how not so different those metrics are, but more importantly, how different they are from the English metrics? And this is just using a limited tool. I am now sure you can get something really accurate by measuring more metrics (I'm thinking in particular of chars statistics).

I can haz cookie?

More hints

Source Link

edited Jun 28, 2011 at 8:34

Julien Guertault

720
4
15

I would be curious to see what are the average metrics of written English on one side, and code on the other side.

length of paragraphs

length of lines

size of words
chars used
ratio between alphabetic, numeric and markother symbol characters
number of symbols per word

etc.

Maybe that alone could discriminate already between code and the rest. At least I believe code, regardless of language, would show some noticeably different metrics in many cases.

The good news is: you already have plenty of data to build your statistics upon.

Source Link

answered Jun 28, 2011 at 8:22

Julien Guertault

720
4
15

Loading

Stack Exchange Network

Return to Answer