
上QQ阅读APP看书,第一时间看更新
Standard tokenizer
Loosely speaking, the standard tokenizer breaks down a stream of characters by separating them with whitespace characters and punctuation.
The following example shows how the standard tokenizer breaks a character stream into tokens:
POST _analyze
{
"tokenizer": "standard",
"text": "Tokenizer breaks characters into tokens!"
}
The preceding command produces the following output; notice the start_offset, end_offset, and positions in the output:
{
"tokens": [
{
"token": "Tokenizer",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "breaks",
"start_offset": 10,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "characters",
"start_offset": 17,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "into",
"start_offset": 28,
"end_offset": 32,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "tokens",
"start_offset": 33,
"end_offset": 39,
"type": "<ALPHANUM>",
"position": 4
}
]
}
This token stream can be further processed by the token filters of the analyzer.