Chapter 09-Building a Custom Analyzer in Elasticsearch

Posted Jun 6, 20203 min read

My Elasticsearch articles are being updated gradually, welcome to follow
0A. About Elasticsearch and example applications
00. Solr vs. ElasticSearch
01. What can ElasticSearch do?
02. Elastic Stack function introduction
03. How to install and set up Elasticsearch API
04. If indexing through the head plugin of elasticsearch_CRUD operation
05. Introduction to the use of multiple instances of Elasticsearch and head plugin
06. How does Elasticsearch work when indexing documents?
07. Mapping method in Elasticsearch-concise tutorial
08.Analysis and analyzer application methods in Elasticsearch
09.Build a custom analyzer in Elasticsearch

In addition to getting started with Elasticsearch, I highly recommend Elasticsearch Basic Getting Started Tutorial To you, I really want to get started guide manual.

Introduction
In the last blog post at this stage, I have explained more about the structure and components of the general analyzer. I also explained the function of each component. In this blog, we will understand the implementation aspect by building a custom analyzer and then querying and viewing the differences.
Customized analyzer housing
Therefore, let us consider the case of customizing the analyzer. Suppose the text we entered into Elasticsearch contains the following

  1. HTML tags

HTML tags may appear in our text during indexing, but this is actually not needed in most cases. So we need to delete these.
2. Stop words
Words such as the, and, or etc. have little meaning when searching for content, and are generally called stop words.
3. Capital letters.
4. Short form such as H2O, $,%.
In some cases, the short form like this should be replaced with the English original word.

Apply a custom analyzer
In the sample text above, the following table lists the actions that need to be performed and the corresponding components of the custom analyzer

Arun has 100 $which accounts to 3%of the total <h2> money </h2>

The hierarchy in "settings" is as follows:

0001.png

Apply all components
Now apply all the above components to create a custom analyzer as follows:

curl -XPUT localhost:9200/testindex_0204 -d'{
  "settings":{
    "analysis":{
      "char_filter":{
        "subsitute":{
          "type":"mapping",
          "mappings":[
            "$=> dollar",
            "%=> percentage"
         ]
        },
        "html-strip":{
          "type":"html_strip"
        }
      },
      "tokenizer":"standard",
      "filter":{
        "stopwords_removal":{
          "type":"stop",
          "stopwords":[
            "has",
            "which",
            "to",
            "of",
            "the"
         ]
        }
      },
      "analyzer":{
        "custom_analyzer_type_01":{
          "type":"custom",
          "char_filter":[
            "subsitute",
            "html_strip"
         ],
          "tokenizer":"standard",
          "filter":[
            "stopwords_removal",
            "lowercase"
         ]
        }
      }
    }
  },
  "mappings":{
    "test_type":{
      "properties":{
        "text":{
          "type":"string",
          "analyzer":"custom_analyzer_type_01"
        }
      }
    }
  }
}'

This will create an index using a custom analyzer called "custom_analyzer_01".
This mapping is explained in detail, and the following diagram illustrates each part:
0003.png
Use a custom analyzer to generate tokens
Using the analyzer, you can see the token generated using this analyzer as follows:

curl -XGET "localhost:9200/testindex_0204/_analyze?analyzer=custom_analyzer_type_01&pretty=true" -d'Arun has 100 $which accounts to 3%of the total <h2> money </h2>'

The list of tokens is as follows:
0004.png
Here you can make some observations:
Token numbers 3 and 6 were originally $and%, but then were replaced with "dollar" and "percent" char_filter as specified in this section.
There is also the html tag <h2>,</h2> which has also been removed from the token list by the html_strip filter
The stopwords mentioned in the filters "to", "the", "which", "has" have been removed from the token list. Token number 1 should initially look like "Arun", but the filter that has been applied is lowercase.

in conclusion
In this blog, we saw how to build a custom analyzer and apply it to fields in Elasticsearch. Through this blog, I intend to end the second phase of the blog series(indexing, mapping, and analysis). From now on, this stage is one of the basic parts of understanding Elasticsearch, we may use the input of this stage for many purposes. Starting from stage 03, I will introduce you to the world of query DSL in elasticsearch.