How can I detect the programming language of a snippet?

Question

I have a string containing some text. The text may or may not be code. Using Github's Linguist, I have been able to detect the likely programming language only if I give it a list of candidates.

# test_linguist_1.rb
#!/usr/bin/env ruby

require 'linguist'

s = "int main(){}"
candidates = [Linguist::Language["Python"], Linguist::Language["C"], Linguist::Language["Ruby"]]
b = Linguist::Blob.new('', s)
langs = Linguist::Classifier.call(b, candidates)
puts langs.inspect

Execution:

$ ./test_linguist_1.rb 
[#<Linguist::Language name=C>, #<Linguist::Language name=Python>, #<Linguist::Language name=Ruby>]

Notice that I gave it a list of candidates. How can I avoid having to define a list of candidates?

I tried the following:

# test_linguist_2.rb
#!/usr/bin/env ruby

require 'linguist'

s = "int main(){}"
candidates = Linguist::Language.all
# I also tried only Popular
# candidates = Linguist.Language.popular
b = Linguist::Blob.new('', s)
langs = Linguist::Classifier.call(b, candidates)
puts langs.inspect

Execution:

$ ./test_linguist_2.rb 
/home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:131:in `token_probability': undefined method `[]' for nil:NilClass (NoMethodError)
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:120:in `block in tokens_probability'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `each'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `inject'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `tokens_probability'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:105:in `block in classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:104:in `each'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:104:in `classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:78:in `classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:20:in `call'
from ./test_linguist.rb:21:in `block in <main>'
from ./test_linguist.rb:14:in `each'
from ./test_linguist.rb:14:in `<main>'

Additional:

Is this the best way to use Github Linguist? FileBlob is an alternative to Blob but this requires writing my string to a file. This is problematic for two reasons 1) it is slow, and 2) the chosen file extension then guides linguist and we do not know the correct file extension.
Are there better tools to do this? Github Linguist perhaps works well over files but not over strings.

mwp · Accepted Answer · 2016-09-09 03:01:44Z

5

Taking a quick look at the source code of Linguist, it appears to use a number of strategies to determine the language, and it calls each strategy in turn. Classifier is the last strategy to be called, by which time it has (hopefully) picked up language "candidates" (as you've discovered for yourself) from the prior strategies. So I think for the particular sample you've shared with us, you have to pass a filename of some kind, even if a file doesn't actually exist, or a list of language candidates. If neither is an option for you, this may not be a feasible solution for your problem.

$ ruby -r linguist -e 'p Linguist::Blob.new("foo.c", "int main(){}").language'
#<Linguist::Language name=C>

It returns nil without a filename, and #<Linguist::Language name=C++> with "foo.cc" and the same code sample.

The good news is that you picked a really bad sample to test with. :-) Other strategies look at modelines and shebangs, so more complex samples have a better chance at succeeding. Take a look at these:

$ ruby -r linguist -e 'p Linguist::Blob.new("", "#!/usr/bin/env perl
print q{Hello, world!};
").language'
#<Linguist::Language name=Perl>
$ ruby -r linguist -e 'p Linguist::Blob.new("", "# vim: ft=ruby
puts %q{Hello, world!}
").language'
#<Linguist::Language name=Ruby>

However, if there isn't a shebang or a modeline, we're still out of luck. It turns out that there's a training dataset that is computed and serialized to disk at install time, and automatically loaded during language detection. Unfortunately, I think there's a bug in the library that is preventing this training dataset from being used if there aren't any candidates by the time it gets to this step. Fixing the bug lets me do this:

$ ruby -Ilib -r linguist -e 'p Linguist::Blob.new("", "int main(){}").language'
#<Linguist::Language name=XC>

(I don't know what XC is, but adding some other tokens to the string such as #include <stdio.h> or int argc, char* argv[] gives C. I'm sure most of your samples will have more meat to analyze.)

It's a real simple fix and I've submitted a PR for it. You can use my fork of the Gem if you'd like in the meantime. Otherwise, we'll need to look into using Linguist::Classify directly, as you've started exploring, but that has the potential to get messy.

To use my fork, add/modify your Gemfile to read as such:

gem 'github-linguist',
  require: 'linguist',
  git: 'https://github.com/mwpastore/linguist.git',
  branch: 'fix-no-candidates'

I'll try to come back and update this answer when the PR has been merged and a new version of the Gem has been released with the fix. If I have to do any force-pushes to meet the repository guidelines and/or make the maintainers happy, you may have to do a bundler update to reflect the changes. Let me know if you have any questions.

edited Sep 9, 2016 at 3:01

answered Sep 9, 2016 at 0:21

mwp

8,57724 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Martin Velez Over a year ago

Thank you for the answer. However, as I mention in the question, I do know the file extension because I do not know the programming language. I am trying to detect the (likely) programming language.

mwp Over a year ago

@MartinVelez Ah, I see that now. My mistake. Let me tinker with it a bit more.

mwp Over a year ago

@MartinVelez I think you found a bug in the gem. :-) I've submitted a PR and documented how you can use my fork. Please let me know your thoughts.

user1934428 Over a year ago

@MartinVelez: The file extension isn't really a reliable source for detecting the programming language. For instance, when writing executable scripts (not libraries) in Perl, Python, Ruby, and shell languages, it is quite common to not use any extension for the file name.

Jörg W Mittag Over a year ago

xC is a language for parallel real-time embedded programming, integrating elements from Occam-π and C. The snippet you used happens to be a complete and valid xC program. It's also a complete and valid Cyclone program. And a complete and valid Objective-C++, Objective-C, C++, and D program. As you said, the longer the program is the more likely it is that the language will be unique. But still: Objective-C is a proper superset of C, thus all C programs are also Objective-C programs, for example.

|

Amadan · Accepted Answer · 2016-09-09 08:34:34Z

-1

Taking another quick look at Linguist source, Linguist::Language.all seems to be what you're looking for.

EDIT: Tried the Linguist::Language.all myself. The failure is due to yet another bug: some languages seem to have faulty data. For example, this also fails:

candidates = [Linguist::Language['ADA']]

This apparently because of the fact that in lib/linguist/samples.json, tokens.ADA doesn't exist. It is not the only such language.

To avoid the bug, you can filter the languages:

non_buggy_languages = Linguist::Samples.cache['tokens'].keys
candidates = non_buggy_languages.map { |l| Linguist::Language[l] }

edited Sep 9, 2016 at 8:34

answered Sep 9, 2016 at 1:19

Amadan

200k23 gold badges252 silver badges321 bronze badges

4 Comments

Martin Velez Over a year ago

thanks! I tried that. It did not work. See the error that triggered.

pchaigno Over a year ago

Linguist selection strategies (be that the Bayesian classifier or the heuristic rules) weren't built to choose a language among all possible languages. They are only refinement strategies. If you try to use them with all languages as input, you'll most likely end up with very poor results.

Amadan Over a year ago

@pchaigno: So? The question was "How can I avoid having to define a list of candidates?", and the answer answered it (at the time; I haven't looked whether or not it still does). It is quite unnecessary to put a disclaimer "if you take away some of the methods we use to improve the classification accuracy, the classification accuracy might suffer".

pchaigno Over a year ago

I disagree. Someone reading your answer might think that Linguist is not doing that by default because of a bug. And it's actually "if you take away some of the methods we use to improve the classification accuracy, the classification accuracy will suffer".

Collectives™ on Stack Overflow

How can I detect the programming language of a snippet?

2 Answers 2

11 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related