How to automatically detect code snippet from a text sample?

Question

I'm doing some analysis on GitHub comments. But for that, I need to exclude the code samples and error messages from the comments automatically from a large set.

The other easier way to say this would be, I can keep only the English part of the comments. Although there are few libraries to detect the language of a sentence, there are few challenges in my case too. 1) the comment part does not always follow proper English grammar, 2) the code sample and error message mainly consist of English words too.

So what should be my best approach. The results don't need to be 100% accurate, I just want to know the best approach that can give me a satisfactory result at least. Any idea?

I need to exclude the code samples and error messages from the comments automatically from a large set. Just to be sure we're clear, you want to extract all the English in the comments, except you do not want to extract code samples and error messages, regardless of the language they're in. Is that correct? — Millie Smith
– Millie Smith, Commented Nov 5, 2017 at 3:33
@MillieSmith, yes that would be my primary goal. However, for simplicity, I can give up acquiring texts from other languages too. So only extracting "English" comments would do as well. — Nasif Imtiaz Ohi
– Nasif Imtiaz Ohi, Commented Nov 5, 2017 at 22:12

dTanMan · Accepted Answer · 2020-10-02 07:14:53Z

2

This question is old, but my Google search led me to this question; so offering this answer in case anyone stumbles into this question, too.

answered Oct 2, 2020 at 7:14

dTanMan

1377 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to automatically detect code snippet from a text sample?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related