Learning Elastic Stack 7.0(Second Edition)
上QQ阅读APP看书,第一时间看更新

Standard tokenizer

Loosely speaking, the standard tokenizer breaks down a stream of characters by separating them with whitespace characters and punctuation.

The following example shows how the standard tokenizer breaks a character stream into tokens:

POST _analyze
{
"tokenizer": "standard",
"text": "Tokenizer breaks characters into tokens!"
}

The preceding command produces the following output; notice the start_offset, end_offset, and positions in the output:

{
"tokens": [
{
"token": "Tokenizer",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "breaks",
"start_offset": 10,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "characters",
"start_offset": 17,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "into",
"start_offset": 28,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "tokens",
"start_offset": 33,
"end_offset": 39,
"type": "<ALPHANUM>",
"position": 4
}
]
}

This token stream can be further processed by the token filters of the analyzer.