1

I'm doing some analysis on GitHub comments. But for that, I need to exclude the code samples and error messages from the comments automatically from a large set.

The other easier way to say this would be, I can keep only the English part of the comments. Although there are few libraries to detect the language of a sentence, there are few challenges in my case too. 1) the comment part does not always follow proper English grammar, 2) the code sample and error message mainly consist of English words too.

So what should be my best approach. The results don't need to be 100% accurate, I just want to know the best approach that can give me a satisfactory result at least. Any idea?

2
  • I need to exclude the code samples and error messages from the comments automatically from a large set. Just to be sure we're clear, you want to extract all the English in the comments, except you do not want to extract code samples and error messages, regardless of the language they're in. Is that correct? Commented Nov 5, 2017 at 3:33
  • @MillieSmith, yes that would be my primary goal. However, for simplicity, I can give up acquiring texts from other languages too. So only extracting "English" comments would do as well. Commented Nov 5, 2017 at 22:12

1 Answer 1

2

This question is old, but my Google search led me to this question; so offering this answer in case anyone stumbles into this question, too.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.