Revisions to Simple method for reliably detecting code in text?

added 308 characters in body

Source Link

edited Jun 28, 2011 at 16:20

2.4k
17
20

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text with no space to separate it: myFunc()
A dot or arrow between two words: foo.bar = ptr->val
Presence of curly braces, brackets: while (true) { bar[i]; }
Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
camelCase text in the post.

nested parentheses, braces, and/or brackets.

One could keep track of the number of times each of these appears, and these could be used as features in a machine-learning algorithm like perceptron, the way SpamAssassin does.

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text with no space to separate it: myFunc()
A dot or arrow between two words: foo.bar = ptr->val
Presence of curly braces, brackets: while (true) { bar[i]; }
Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text with no space to separate it: myFunc()
A dot or arrow between two words: foo.bar = ptr->val
Presence of curly braces, brackets: while (true) { bar[i]; }
Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
camelCase text in the post.

nested parentheses, braces, and/or brackets.

One could keep track of the number of times each of these appears, and these could be used as features in a machine-learning algorithm like perceptron, the way SpamAssassin does.

Post Made Community Wiki by Omar Kooheji

occurred Jun 28, 2011 at 9:43

added 147 characters in body

Source Link

edited Jun 28, 2011 at 8:42

Yevgeniy Brikman

2.6k
1
22
23

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a lineSemi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text with no space to separate it: myFunc()
A dot or arrow between two words: foo.bar = ptr->val

Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
Presence of curly braces, brackets: while (true) { bar[i]; }
Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __

Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text with no space to separate it: myFunc()
A dot or arrow between two words: foo.bar = ptr->val

Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
Presence of curly braces, brackets: while (true) { bar[i]; }
Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text with no space to separate it: myFunc()
A dot or arrow between two words: foo.bar = ptr->val
Presence of curly braces, brackets: while (true) { bar[i]; }
Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __

Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.

added 147 characters in body

Source Link

edited Jun 28, 2011 at 8:37

Yevgeniy Brikman

2.6k
1
22
23

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text with no space to separate it: myFunc()

A dot or arrow between two words: foo.bar = ptr->val
Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
Presence of curly braces, brackets and equals: while (true) { foobar[i]; }

Presence of "comment" syntax (/*, //, etc): /* multi-line comment */

Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, bar[i];>=, }<=, >>, <<, ::, __

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text: myFunc()
Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
Presence of curly braces, brackets and equals: while (true) { foo = bar[i]; }

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
Parentheses directly following text with no space to separate it: myFunc()

A dot or arrow between two words: foo.bar = ptr->val
Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
Presence of curly braces, brackets: while (true) { bar[i]; }

Presence of "comment" syntax (/*, //, etc): /* multi-line comment */

Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __

Source Link

answered Jun 28, 2011 at 8:31

Yevgeniy Brikman

2.6k
1
22
23

Loading

Stack Exchange Network

Return to Answer