Skip to main content
added 308 characters in body
Source Link
Ken Bloom
  • 2.4k
  • 17
  • 20

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text with no space to separate it: myFunc()
  3. A dot or arrow between two words: foo.bar = ptr->val
  4. Presence of curly braces, brackets: while (true) { bar[i]; }
  5. Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
  6. Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
  7. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
  8. camelCase text in the post.
  9. nested parentheses, braces, and/or brackets.

One could keep track of the number of times each of these appears, and these could be used as features in a machine-learning algorithm like perceptron, the way SpamAssassin does.

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text with no space to separate it: myFunc()
  3. A dot or arrow between two words: foo.bar = ptr->val
  4. Presence of curly braces, brackets: while (true) { bar[i]; }
  5. Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
  6. Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
  7. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text with no space to separate it: myFunc()
  3. A dot or arrow between two words: foo.bar = ptr->val
  4. Presence of curly braces, brackets: while (true) { bar[i]; }
  5. Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
  6. Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
  7. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
  8. camelCase text in the post.
  9. nested parentheses, braces, and/or brackets.

One could keep track of the number of times each of these appears, and these could be used as features in a machine-learning algorithm like perceptron, the way SpamAssassin does.

Post Made Community Wiki by Omar Kooheji
added 147 characters in body
Source Link
Yevgeniy Brikman
  • 2.6k
  • 1
  • 22
  • 23

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a lineSemi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text with no space to separate it: myFunc()
  3. A dot or arrow between two words: foo.bar = ptr->val
  4. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
  5. Presence of curly braces, brackets: while (true) { bar[i]; }
  6. Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
  7. Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
  8. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text with no space to separate it: myFunc()
  3. A dot or arrow between two words: foo.bar = ptr->val
  4. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
  5. Presence of curly braces, brackets: while (true) { bar[i]; }
  6. Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
  7. Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text with no space to separate it: myFunc()
  3. A dot or arrow between two words: foo.bar = ptr->val
  4. Presence of curly braces, brackets: while (true) { bar[i]; }
  5. Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
  6. Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
  7. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
added 147 characters in body
Source Link
Yevgeniy Brikman
  • 2.6k
  • 1
  • 22
  • 23

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text with no space to separate it: myFunc()
  3. A dot or arrow between two words: foo.bar = ptr->val
  4. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
  5. Presence of curly braces, brackets and equals: while (true) { foobar[i]; }
  6. Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
  7. Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, bar[i];>=, }<=, >>, <<, ::, __

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text: myFunc()
  3. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
  4. Presence of curly braces, brackets and equals: while (true) { foo = bar[i]; }

A proper solution would probably be some learned/statistical model, but here are some fun ideas:

  1. Semi-colons at the end of a line. This alone would catch a whole bunch of languages.
  2. Parentheses directly following text with no space to separate it: myFunc()
  3. A dot or arrow between two words: foo.bar = ptr->val
  4. Run your syntax highlighter on the text. If it ends up highlighting some high percentage of it, it's probably code.
  5. Presence of curly braces, brackets: while (true) { bar[i]; }
  6. Presence of "comment" syntax (/*, //, etc): /* multi-line comment */
  7. Uncommon characters/operators: +, *, &, &&, |, ||, <, >, ==, !=, >=, <=, >>, <<, ::, __
Source Link
Yevgeniy Brikman
  • 2.6k
  • 1
  • 22
  • 23
Loading