text analyzer has 3 main components, character filters, pre tokenizer, token filters.
- character filter: It is used to filter out some characters before tokenization. e.g.
to_lowercase,unicode_normalization. - pre tokenizer: It is used to split the text into tokens. e.g.
unicode segmentationwill split texts according to theUnicode Standard Annex #29 - token filter: It is used to filter out some tokens after tokenization. e.g.
stopwords,stemmer.
We support following character filters:
to_lowercase: Convert all characters to lowercase.unicode_normalization: Normalize the text according to the Unicode Normalization Forms (NFC, NFD, NFKC, NFKD).
We support following pre tokenizers:
regex: Generate tokens by matching the regular expression.unicode_segmentation: Split the text into tokens according to theUnicode Standard Annex #29.jieba: Chinese text segmentation using the Jieba library.
We support following token filters:
skip_non_alphanumeric: Skip tokens where all characters are not alphanumeric.stemmer: Stem tokens using the Snowball stemmer algorithm.stopwords: Filter out tokens that are in the stop words list.synonym: Replace tokens with their synonyms.pg_dict: Process tokens using the PostgreSQL dictionary. You can integrate this with the PostgreSQL dictionary or other extensions that provide dictionaries.
arabic, armenian, basque, catalan, danish, dutch, english_porter, english_porter2, estonian, finnish, french, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian, spanish, swedish, tamil, turkish, yiddish
We support customize stopwords and synonym by providing a dictionary.
-- Create a dictionary for stopwords, each line is a stopword.
SELECT create_stopwords('stop1', $$
it
is
an
$$);
SELECT tokenizer_catalog.create_text_analyzer('test_stopwords', $$
pre_tokenizer = "unicode_segmentation"
[[character_filters]]
to_lowercase = {}
[[token_filters]]
stopwords = "stop1"
$$);
SELECT tokenizer_catalog.apply_text_analyzer('It is an apple.', 'test_stopwords');
----
{apple}-- Create a dictionary for synonyms, each line is a synonym.
SELECT create_synonym('syn1', $$
pgsql postgres postgresql
index indices
$$);
SELECT tokenizer_catalog.create_text_analyzer('test_synonym', $$
pre_tokenizer = "unicode_segmentation"
[[token_filters]]
synonym = "syn1"
$$);
SELECT tokenizer_catalog.apply_text_analyzer('postgresql indices', 'test_synonym');
----
{pgsql,index}