Learning Elastic Stack 7.0（Second Edition）

上QQ阅读APP看书，第一时间看更新

Standard tokenizer

Loosely speaking, the standard tokenizer breaks down a stream of characters by separating them with whitespace characters and punctuation.

The following example shows how the standard tokenizer breaks a character stream into tokens:

POST _analyze
{
  "tokenizer": "standard",
  "text": "Tokenizer breaks characters into tokens!"
}

The preceding command produces the following output; notice the start_offset, end_offset, and positions in the output:

{
  "tokens": [
    {
      "token": "Tokenizer",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "breaks",
      "start_offset": 10,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "characters",
      "start_offset": 17,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "into",
      "start_offset": 28,
      "end_offset": 32,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "tokens",
      "start_offset": 33,
      "end_offset": 39,
      "type": "<ALPHANUM>",
      "position": 4
    }
  ]
}

This token stream can be further processed by the token filters of the analyzer.