Shingle token filter
Add shingles, or word n-grams, to a token stream by concatenating adjacent tokens. By default, the shingle token filter outputs two-word shingles and unigrams.
For example, many tokenizers convert the lazy dog to [ the, lazy, dog ]. You can use the shingle filter to add two-word shingles to this stream: [ the, the lazy, lazy, lazy dog, dog ].
Shingles are often used to help speed up phrase queries, such as match_phrase. Rather than creating shingles using the shingles filter, we recommend you use the index-phrases mapping parameter on the appropriate text field instead.
This filter uses Lucene’s ShingleFilter.
The following analyze API request uses the shingle filter to add two-word shingles to the token stream for quick brown fox jumps:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [ "shingle" ],
"text": "quick brown fox jumps"
}
The filter produces the following tokens:
[ quick, quick brown, brown, brown fox, fox, fox jumps, jumps ]
To produce shingles of 2-3 words, add the following arguments to the analyze API request:
min_shingle_size:2max_shingle_size:3
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
],
"text": "quick brown fox jumps"
}
The filter produces the following tokens:
[ quick, quick brown, quick brown fox, brown, brown fox, brown fox jumps, fox, fox jumps, jumps ]
To only include shingles in the output, add an output_unigrams argument of false to the request.
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3,
"output_unigrams": false
}
],
"text": "quick brown fox jumps"
}
The filter produces the following tokens:
[ quick brown, quick brown fox, brown fox, brown fox jumps, fox jumps ]
The following create index API request uses the shingle filter to configure a new custom analyzer.
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"standard_shingle": {
"tokenizer": "standard",
"filter": [ "shingle" ]
}
}
}
}
}
max_shingle_size-
(Optional, integer) Maximum number of tokens to concatenate when creating shingles. Defaults to
2.NoteThis value cannot be lower than the
min_shingle_sizeargument, which defaults to2. The difference between this value and themin_shingle_sizeargument cannot exceed theindex.max_shingle_diffindex-level setting, which defaults to3. min_shingle_size-
(Optional, integer) Minimum number of tokens to concatenate when creating shingles. Defaults to
2.NoteThis value cannot exceed the
max_shingle_sizeargument, which defaults to2. The difference between themax_shingle_sizeargument and this value cannot exceed theindex.max_shingle_diffindex-level setting, which defaults to3. output_unigrams- (Optional, Boolean) If
true, the output includes the original input tokens. Iffalse, the output only includes shingles; the original input tokens are removed. Defaults totrue. output_unigrams_if_no_shingles-
If
true, the output includes the original input tokens only if no shingles are produced; if shingles are produced, the output only includes shingles. Defaults tofalse.ImportantIf both this and the
output_unigramsparameter aretrue, only theoutput_unigramsargument is used. token_separator- (Optional, string) Separator used to concatenate adjacent tokens to form a shingle. Defaults to a space (
" "). filler_token- (Optional, string) String used in shingles as a replacement for empty positions that do not contain a token. This filler token is only used in shingles, not original unigrams. Defaults to an underscore (
_).
Some token filters, such as the stop filter, create empty positions when removing stop words with a position increment greater than one.
Example
In the following analyze API request, the stop filter removes the stop word a from fox jumps a lazy dog, creating an empty position. The subsequent shingle filter replaces this empty position with a plus sign (+) in shingles.
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "stop",
"stopwords": [ "a" ]
},
{
"type": "shingle",
"filler_token": "+"
}
],
"text": "fox jumps a lazy dog"
}
The filter produces the following tokens:
[ fox, fox jumps, jumps, jumps +, + lazy, lazy, lazy dog, dog ]
To customize the shingle filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following create index API request uses a custom shingle filter, my_shingle_filter, to configure a new custom analyzer.
The my_shingle_filter filter uses a min_shingle_size of 2 and a max_shingle_size of 5, meaning it produces shingles of 2-5 words. The filter also includes a output_unigrams argument of false, meaning that only shingles are included in the output.
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"en": {
"tokenizer": "standard",
"filter": [ "my_shingle_filter" ]
}
},
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 5,
"output_unigrams": false
}
}
}
}
}