Lets say you want to create a search engine for a site like github or stackoverflow, where the majority of the textual content is actually source code. What would be a good Lucene tokenizer for dealing with documents of this type?
1 Answer
This is what you are looking for : http://www.opensourceconnections.com/2013/02/18/indexing-stackoverflow-in-solr/ . This covers all the steps and more.
StandardAnalyzerwould probably work pretty well, or possibly a custom analyzer likeStandardAnalyzerbut without theLowercaseFilter, depending on your needs. Is there some particular feature you are looking for, as far as how you would like code to be tokenized?