+ url="http://snowball.tartarus.org">Snowball site for more
+ information). The Snowball project supplies a large number of stemmers for
+ many languages. A Snowball dictionary requires a language parameter to
+ identify which stemmer to use, and optionally can specify a stopword file name.
+ For example, there is a built-in definition equivalent to
+
CREATE TEXT SEARCH DICTIONARY english_stem (
TEMPLATE = snowball, Language = english, StopWords = english
);
-
+
+
+ The
Snowball> dictionary recognizes everything, so it is best
+ to place it at the end of the dictionary stack. It it useless to have it
+ before any other dictionary because a lexeme will never pass through it to
+ the next dictionary.
+
-The
Snowball> dictionary recognizes everything, so it is best
-to place it at the end of the dictionary stack. It it useless to have it
-before any other dictionary because a lexeme will never pass through it to
-the next dictionary.
-
+
-
+
+
Dictionary Testing
->
-
Dictionary Testing
+ The ts_lexize> function facilitates dictionary testing:
-The ts_lexize> function facilitates dictionary testing:
+
-
-
+
-
-
+
+
-
-
-ts_lexize(dict_name text, lexeme text) returns text[]
-
-
+
+
+ ts_lexize(dict_name text, lexeme text) returns text[]
+
+
+
+
+ Returns an array of lexemes if the input lexeme
+ is known to the dictionary dictname, or a void
+ array if the lexeme is known to the dictionary but it is a stop word, or
+ NULL if it is an unknown word.
+
-
-Returns an array of lexemes if the input lexeme
-is known to the dictionary dictname, or a void
-array if the lexeme is known to the dictionary but it is a stop word, or
-NULL if it is an unknown word.
-
SELECT ts_lexize('english_stem', 'stars');
ts_lexize
-----------
{star}
+
SELECT ts_lexize('english_stem', 'a');
ts_lexize
-----------
{}
-
-
+
+
+
+
+
-
-
+
+ The ts_lexize function expects a
+ lexeme, not text. Below is an example:
-
-The ts_lexize function expects a
-lexeme, not text. Below is an example:
SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
?column?
----------
t
-The thesaurus dictionary thesaurus_astro does know
-supernovae stars, but ts_lexize> fails since it
-does not parse the input text and considers it as a single lexeme. Use
-plainto_tsquery> and to_tsvector> to test thesaurus
-dictionaries:
+
+ The thesaurus dictionary thesaurus_astro does know
+ supernovae stars, but ts_lexize> fails since it
+ does not parse the input text and considers it as a single lexeme. Use
+ plainto_tsquery> and to_tsvector> to test thesaurus
+ dictionaries:
+
SELECT plainto_tsquery('supernovae stars');
plainto_tsquery
-----------------
'sn'
-
-
-
-
-
-
-
Configuration Example
-
-A full text configuration specifies all options necessary to transform a
-document into a tsvector: the parser breaks text into tokens,
-and the dictionaries transform each token into a lexeme. Every call to
-to_tsvector() and to_tsquery()
-needs a configuration to perform its processing. To facilitate management
-of full text searching objects, a set of
SQL commands
-is available, and there are several psql commands which display information
-about full text searching objects ().
-
-
-The configuration parameter
-
-specifies the name of the current default configuration, which is the
-one used by text search functions when an explicit configuration
-parameter is omitted.
-It can be set in postgresql.conf, or set for an
-individual session using the SET> command.
-
-
-Several predefined text searching configurations are available in the
-pg_catalog schema. If you need a custom configuration
-you can create a new text searching configuration and modify it using SQL
-commands.
-
-New text searching objects are created in the current schema by default
-(usually the public schema), but a schema-qualified
-name can be used to create objects in the specified schema.
-
-
-As an example, we will create a configuration
-pg which starts as a duplicate of the
-english> configuration. To be safe, we do this in a transaction:
+
+
+
+
+
+
+
Configuration Example
+
+ A full text configuration specifies all options necessary to transform a
+ document into a tsvector: the parser breaks text into tokens,
+ and the dictionaries transform each token into a lexeme. Every call to
+ to_tsvector() and to_tsquery()
+ needs a configuration to perform its processing. To facilitate management
+ of full text searching objects, a set of
SQL commands
+ is available, and there are several psql commands which display information
+ about full text searching objects ().
+
+
+ The configuration parameter
+
+ specifies the name of the current default configuration, which is the
+ one used by text search functions when an explicit configuration
+ parameter is omitted.
+ It can be set in postgresql.conf, or set for an
+ individual session using the SET> command.
+
+
+ Several predefined text searching configurations are available in the
+ pg_catalog schema. If you need a custom configuration
+ you can create a new text searching configuration and modify it using SQL
+ commands.
+
+
+ New text searching objects are created in the current schema by default
+ (usually the public schema), but a schema-qualified
+ name can be used to create objects in the specified schema.
+
+
+ As an example, we will create a configuration
+ pg which starts as a duplicate of the
+ english> configuration. To be safe, we do this in a transaction:
+
BEGIN;
CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = english );
-
+
+
+ We will use a PostgreSQL-specific synonym list
+ and store it in share/tsearch_data/pg_dict.syn.
+ The file contents look like:
-We will use a PostgreSQL-specific synonym list
-and store it in share/tsearch_data/pg_dict.syn.
-The file contents look like:
postgres pg
pgsql pg
postgresql pg
-We define the dictionary like this:
+ We define the dictionary like this:
+
CREATE TEXT SEARCH DICTIONARY pg_dict (
TEMPLATE = synonym
);
-
+
-Then register the
ispell> dictionary
-english_ispell using the ispell template:
+
Then register the
ispell> dictionary
+ english_ispell using the ispell template:
CREATE TEXT SEARCH DICTIONARY english_ispell (
StopWords = english
);
-
+
-Now modify mappings for Latin words for configuration pg>:
+ Now modify mappings for Latin words for configuration pg>:
ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR lword, lhword, lpart_hword
WITH pg_dict, english_ispell, english_stem;
-
+
-We do not index or search some tokens:
+ We do not index or search some tokens:
ALTER TEXT SEARCH CONFIGURATION pg
DROP MAPPING FOR email, url, sfloat, uri, float;
-
+
+
+ Now, we can test our configuration:
-Now, we can test our configuration:
SELECT * FROM ts_debug('public.pg', '
PostgreSQL, the highly scalable, SQL compliant, open source object-relational
version of our software: PostgreSQL 8.3.
');
-COMMIT;
+ COMMIT;
-
+
+
+ With the dictionaries and mappings set up, suppose we have a table
+ pgweb which contains 11239 documents from the
+
PostgreSQL web site. Only relevant columns
+ are shown:
-With the dictionaries and mappings set up, suppose we have a table
-pgweb which contains 11239 documents from the
-
PostgreSQL web site. Only relevant columns
-are shown:
=> \d pgweb
Table "public.pgweb"
title | character varying |
dlm | date |
-
+
+
+ The next step is to set the session to use the new configuration, which was
+ created in the public> schema:
-The next step is to set the session to use the new configuration, which was
-created in the public> schema:
=> \dF
-postgres=# \dF public.*
-List of fulltext configurations
- Schema | Name | Description
---------+------+-------------
- public | pg |
+ List of fulltext configurations
+ Schema | Name | Description
+---------+------+-------------
+ public | pg |
SET default_text_search_config = 'public.pg';
SET
----------------------------
public.pg
-
+
+
+
-
+
+
Managing Multiple Configurations
-
-
Managing Multiple Configurations
+ If you are using the same text search configuration for the entire cluster
+ just set the value in postgresql.conf>. If using a single
+ text search configuration for an entire database, use ALTER
+ DATABASE ... SET>.
+
-If you are using the same text search configuration for the entire cluster
-just set the value in postgresql.conf>. If using a single
-text search configuration for an entire database, use ALTER
-DATABASE ... SET>.
-
+ However, if you need to use several text search configurations in the same
+ database you must be careful to reference the proper text search
+ configuration. This can be done by either setting
+ default_text_search_config> in each session or supplying the
+ configuration name in every function call, e.g. to_tsquery('french',
+ 'friend'), to_tsvector('english', col). If you are using an expression
+ index you must embed the configuration name into the expression index, e.g.:
-However, if you need to use several text search configurations in the same
-database you must be careful to reference the proper text search
-configuration. This can be done by either setting
-default_text_search_config> in each session or supplying the
-configuration name in every function call, e.g. to_tsquery('french',
-'friend'), to_tsvector('english', col). If you are using an expression
-index you must embed the configuration name into the expression index, e.g.:
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('french', title || body));
-And for an expression index, specify the configuration name in the
-WHERE> clause as well so the expression index will be used.
-
-
+ And for an expression index, specify the configuration name in the
+ WHERE> clause as well so the expression index will be used.
+
+
+
-
+
-
-
GiST and GIN Index Types
+
+
GiST and GIN Index Types
-There are two kinds of indexes which can be used to speed up full text
-operators ().
-Note that indexes are not mandatory for full text searching.
+ There are two kinds of indexes which can be used to speed up full text
+ operators ().
+ Note that indexes are not mandatory for full text searching.
-
+
-
+
-
-GIST, for text searching
-
+
+ GIST, for text searching
+
-
-
-CREATE INDEX name ON table USING gist(column);
-
-
+
+
+ CREATE INDEX name ON table USING gist(column);
+
+
-
-Creates a GiST (Generalized Search Tree)-based index.
-The column can be of tsvector> or
-tsquery> type.
-
+
+ Creates a GiST (Generalized Search Tree)-based index.
+ The column can be of tsvector> or
+ tsquery> type.
+
+
+
-
-
+
-
+
+ GIN
+
-
-GIN
-
+
+
+ CREATE INDEX name ON table USING gin(column);
+
+
-
-
-CREATE INDEX name ON table USING gin(column);
-
-
+
+ Creates a GIN (Generalized Inverted Index)-based index.
+ The column must be of tsvector> type.
+
+
+
-
-Creates a GIN (Generalized Inverted Index)-based index.
-The column must be of tsvector> type.
-
+
+
-
-
+ A GiST index is lossy, meaning it is necessary
+ to check the actual table row to eliminate false matches.
+
PostgreSQL does this automatically; for
+ example, in the query plan below, the Filter:
+ line indicates the index output will be rechecked:
-
-
-
-A GiST index is lossy, meaning it is necessary
-to check the actual table row to eliminate false matches.
-
PostgreSQL does this automatically; for
-example, in the query plan below, the Filter:
-line indicates the index output will be rechecked:
EXPLAIN SELECT * FROM apod WHERE textsearch @@ to_tsquery('supernovae');
QUERY PLAN
Index Cond: (textsearch @@ '''supernova'''::tsquery)
Filter: (textsearch @@ '''supernova'''::tsquery)
-GiST index lossiness happens because each document is represented by a
-fixed-length signature. The signature is generated by hashing (crc32) each
-word into a random bit in an n-bit string and all words combine to produce
-an n-bit document signature. Because of hashing there is a chance that
-some words hash to the same position and could result in a false hit.
-Signatures calculated for each document in a collection are stored in an
-RD-tree (Russian Doll tree), invented by Hellerstein,
-which is an adaptation of R-tree for sets. In our case
-the transitive containment relation is realized by
-superimposed coding (Knuth, 1973) of signatures, i.e., a parent is the
-result of 'OR'-ing the bit-strings of all children. This is a second
-factor of lossiness. It is clear that parents tend to be full of
-1>s (degenerates) and become quite useless because of the
-limited selectivity. Searching is performed as a bit comparison of a
-signature representing the query and an RD-tree entry.
-If all 1>s of both signatures are in the same position we
-say that this branch probably matches the query, but if there is even one
-discrepancy we can definitely reject this branch.
-
-
-Lossiness causes serious performance degradation since random access of
-heap records is slow and limits the usefulness of GiST
-indexes. The likelihood of false hits depends on several factors, like
-the number of unique words, so using dictionaries to reduce this number
-is recommended.
-
-
-Actually, this is not the whole story. GiST indexes have an optimization
-for storing small tsvectors (< TOAST_INDEX_TARGET
-bytes, 512 bytes). On leaf pages small tsvectors are stored unchanged,
-while longer ones are represented by their signatures, which introduces
-some lossiness. Unfortunately, the existing index API does not allow for
-a return value to say whether it found an exact value (tsvector) or whether
-the result needs to be checked. This is why the GiST index is
-currently marked as lossy. We hope to improve this in the future.
-
-
-GIN indexes are not lossy but their performance depends logarithmically on
-the number of unique words.
-
-
-There is one side-effect of the non-lossiness of a GIN index when using
-query labels/weights, like 'supernovae:a'. A GIN index
-has all the information necessary to determine a match, so the heap is
-not accessed. However, label information is not stored in the index,
-so if the query involves label weights it must access
-the heap. Therefore, a special full text search operator @@@
-was created which forces the use of the heap to get information about
-labels. GiST indexes are lossy so it always reads the heap and there is
-no need for a special operator. In the example below,
-fulltext_idx is a GIN index:
+
+ GiST index lossiness happens because each document is represented by a
+ fixed-length signature. The signature is generated by hashing (crc32) each
+ word into a random bit in an n-bit string and all words combine to produce
+ an n-bit document signature. Because of hashing there is a chance that
+ some words hash to the same position and could result in a false hit.
+ Signatures calculated for each document in a collection are stored in an
+ RD-tree (Russian Doll tree), invented by Hellerstein,
+ which is an adaptation of R-tree for sets. In our case
+ the transitive containment relation is realized by
+ superimposed coding (Knuth, 1973) of signatures, i.e., a parent is the
+ result of 'OR'-ing the bit-strings of all children. This is a second
+ factor of lossiness. It is clear that parents tend to be full of
+ 1>s (degenerates) and become quite useless because of the
+ limited selectivity. Searching is performed as a bit comparison of a
+ signature representing the query and an RD-tree entry.
+ If all 1>s of both signatures are in the same position we
+ say that this branch probably matches the query, but if there is even one
+ discrepancy we can definitely reject this branch.
+
+
+ Lossiness causes serious performance degradation since random access of
+ heap records is slow and limits the usefulness of GiST
+ indexes. The likelihood of false hits depends on several factors, like
+ the number of unique words, so using dictionaries to reduce this number
+ is recommended.
+
+
+ Actually, this is not the whole story. GiST indexes have an optimization
+ for storing small tsvectors (< TOAST_INDEX_TARGET
+ bytes, 512 bytes). On leaf pages small tsvectors are stored unchanged,
+ while longer ones are represented by their signatures, which introduces
+ some lossiness. Unfortunately, the existing index API does not allow for
+ a return value to say whether it found an exact value (tsvector) or whether
+ the result needs to be checked. This is why the GiST index is
+ currently marked as lossy. We hope to improve this in the future.
+
+
+ GIN indexes are not lossy but their performance depends logarithmically on
+ the number of unique words.
+
+
+ There is one side-effect of the non-lossiness of a GIN index when using
+ query labels/weights, like 'supernovae:a'. A GIN index
+ has all the information necessary to determine a match, so the heap is
+ not accessed. However, label information is not stored in the index,
+ so if the query involves label weights it must access
+ the heap. Therefore, a special full text search operator @@@
+ was created which forces the use of the heap to get information about
+ labels. GiST indexes are lossy so it always reads the heap and there is
+ no need for a special operator. In the example below,
+ fulltext_idx is a GIN index:
+
EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
QUERY PLAN
Filter: (textsearch @@@ '''supernova'':A'::tsquery)
-
-
-In choosing which index type to use, GiST or GIN, consider these differences:
-
-GiN index lookups are three times faster than GiST
-
-GiN indexes take three times longer to build than GiST
-
-GiN is about ten times slower to update than GiST
-
-GiN indexes are two-to-three times larger than GiST
-
-
-
-
-In summary,
GIN indexes are best for static data because
-the indexes are faster for lookups. For dynamic data, GiST indexes are
-faster to update. Specifically,
GiST indexes are very
-good for dynamic data and fast if the number of unique words (lexemes) is
-under 100,000, while
GIN handles +100,000 lexemes better
-but is slower to update.
-
-
-Partitioning of big collections and the proper use of GiST and GIN indexes
-allows the implementation of very fast searches with online update.
-Partitioning can be done at the database level using table inheritance
-and constraint_exclusion>, or distributing documents over
-servers and collecting search results using the contrib/dblink>
-extension module. The latter is possible because ranking functions use
-only local information.
-
-
-
-
-
-
Limitations
-
-The current limitations of Full Text Searching are:
-
-
The length of each lexeme must be less than 2K bytes
-
The length of a tsvector (lexemes + positions) must be less than 1 megabyte
-
The number of lexemes must be less than 264
-
Positional information must be non-negative and less than 16,383
-
No more than 256 positions per lexeme
-
The number of nodes (lexemes + operations) in tsquery must be less than 32,768
-
-
-
-For comparison, the
PostgreSQL 8.1 documentation
-contained 10,441 unique words, a total of 335,420 words, and the most frequent
-word postgresql> was mentioned 6,127 times in 655 documents.
-
-
-
-Another example — the
PostgreSQL mailing list
-archives contained 910,989 unique words with 57,491,343 lexemes in 461,020
-messages.
-
-
-
-
-
-
-Information about full text searching objects can be obtained
-in psql using a set of commands:
-
-\dF{,d,p}+ PATTERN
-
-An optional + produces more details.
-
-The optional parameter PATTERN should be the name of
-a full text searching object, optionally schema-qualified. If
-PATTERN is not specified then information about all
-visible objects will be displayed. PATTERN can be a
-regular expression and can apply separately to schema
-names and object names. The following examples illustrate this:
+
+
+ In choosing which index type to use, GiST or GIN, consider these differences:
+
+
+ GiN index lookups are three times faster than GiST
+
+
+
+ GiN indexes take three times longer to build than GiST
+
+
+
+ GiN is about ten times slower to update than GiST
+
+
+
+ GiN indexes are two-to-three times larger than GiST
+
+
+
+
+
+ In summary,
GIN indexes are best for static data because
+ the indexes are faster for lookups. For dynamic data, GiST indexes are
+ faster to update. Specifically,
GiST indexes are very
+ good for dynamic data and fast if the number of unique words (lexemes) is
+ under 100,000, while
GIN handles +100,000 lexemes better
+ but is slower to update.
+
+
+ Partitioning of big collections and the proper use of GiST and GIN indexes
+ allows the implementation of very fast searches with online update.
+ Partitioning can be done at the database level using table inheritance
+ and constraint_exclusion>, or distributing documents over
+ servers and collecting search results using the contrib/dblink>
+ extension module. The latter is possible because ranking functions use
+ only local information.
+
+
+
+
+
+
Limitations
+
+ The current limitations of Full Text Searching are:
+
+
+
The length of each lexeme must be less than 2K bytes
+
+
+
The length of a tsvector (lexemes + positions) must be less than 1 megabyte
+
+
+
The number of lexemes must be less than 264
+
+
+
Positional information must be non-negative and less than 16,383
+
+
+
No more than 256 positions per lexeme
+
+
+
The number of nodes (lexemes + operations) in tsquery must be less than 32,768
+
+
+
+
+ For comparison, the
PostgreSQL 8.1 documentation
+ contained 10,441 unique words, a total of 335,420 words, and the most frequent
+ word postgresql> was mentioned 6,127 times in 655 documents.
+
+
+
+ Another example — the
PostgreSQL mailing list
+ archives contained 910,989 unique words with 57,491,343 lexemes in 461,020
+ messages.
+
+
+
+
+
+
+ Information about full text searching objects can be obtained
+ in psql using a set of commands:
+
+ \dF{,d,p}+ PATTERN
+
+ An optional + produces more details.
+
+
+ The optional parameter PATTERN should be the name of
+ a full text searching object, optionally schema-qualified. If
+ PATTERN is not specified then information about all
+ visible objects will be displayed. PATTERN can be a
+ regular expression and can apply separately to schema
+ names and object names. The following examples illustrate this:
+
=> \dF *fulltext*
List of fulltext configurations
fulltext | fulltext_cfg |
public | fulltext_cfg |
-
+
-
+
-\dF[+] [PATTERN]
+ \dF[+] [PATTERN]
- List full text searching configurations (add "+" for more detail)
+ List full text searching configurations (add "+" for more detail)
By default (without PATTERN), information about
all visible full text configurations will be
displayed.
+
=> \dF russian
List of fulltext configurations
pg_catalog | russian | default configuration for Russian
=> \dF+ russian
-Configuration "pg_catalog.russian"
-Parser name: "pg_catalog.default"
+ Configuration "pg_catalog.russian"
+ Parser name: "pg_catalog.default"
Token | Dictionaries
--------------+-------------------------
email | pg_catalog.simple
version | pg_catalog.simple
word | pg_catalog.russian_stem
-
+
-\dFd[+] [PATTERN]
+ \dFd[+] [PATTERN]
- List full text dictionaries (add "+" for more detail).
+ List full text dictionaries (add "+" for more detail).
By default (without PATTERN), information about
all visible dictionaries will be displayed.
+
=> \dFd
List of fulltext dictionaries
pg_catalog | swedish | Snowball stemmer for swedish language
pg_catalog | turkish | Snowball stemmer for turkish language
-
+
-\dFp[+] [PATTERN]
+ \dFp[+] [PATTERN]
- List full text parsers (add "+" for more detail)
+ List full text parsers (add "+" for more detail)
By default (without PATTERN), information about
all visible full text parsers will be displayed.
-=> \dFp
+ => \dFp
List of fulltext parsers
Schema | Name | Description
------------+---------+---------------------
pg_catalog | default | default word parser
-(1 row)
+ (1 row)
=> \dFp+
Fulltext parser "pg_catalog.default"
Method | Function | Description
word | Word
(23 rows)
-
+
-
-
+
-
-
Debugging
+
+
Debugging
-Function ts_debug allows easy testing of your full text searching
-configuration.
-
+ Function ts_debug allows easy testing of your full text searching
+ configuration.
+
-
-ts_debug(config_name, document TEXT) returns SETOF ts_debug
-
+
+ ts_debug(config_name, document TEXT) returns SETOF ts_debug
+
+
+ ts_debug> displays information about every token of
+ document as produced by the
+ parser and processed by the configured dictionaries using the configuration
+ specified by config_name.
+
+
+ ts_debug type defined as:
-ts_debug> displays information about every token of
-document as produced by the
-parser and processed by the configured dictionaries using the configuration
-specified by config_name.
-
-ts_debug type defined as:
CREATE TYPE ts_debug AS (
"Alias" text,
"Lexized token" text
);
-
+
+
+ For a demonstration of how function ts_debug works we
+ first create a public.english configuration and
+ ispell dictionary for the English language. You can skip the test step and
+ play with the standard english configuration.
+
-For a demonstration of how function ts_debug works we
-first create a public.english configuration and
-ispell dictionary for the English language. You can skip the test step and
-play with the standard english configuration.
-
CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
);
ALTER TEXT SEARCH CONFIGURATION public.english
- ALTER MAPPING FOR lword WITH english_ispell, english_stem;
+ ALTER MAPPING FOR lword WITH english_ispell, english_stem;
lword | Latin word | supernovaes | {public.english_ispell,pg_catalog.english_stem} | pg_catalog.english_stem: {supernova}
(5 rows)
-In this example, the word Brightest> was recognized by a
-parser as a Latin word (alias lword)
-and came through the dictionaries public.english_ispell> and
-pg_catalog.english_stem. It was recognized by
-public.english_ispell, which reduced it to the noun
-bright. The word supernovaes is unknown
-by the public.english_ispell dictionary so it was passed to
-the next dictionary, and, fortunately, was recognized (in fact,
-public.english_stem is a stemming dictionary and recognizes
-everything; that is why it was placed at the end of the dictionary stack).
-
-
-The word The was recognized by public.english_ispell
-dictionary as a stop word () and will not be indexed.
-
-
-You can always explicitly specify which columns you want to see:
+
+ In this example, the word Brightest> was recognized by a
+ parser as a Latin word (alias lword)
+ and came through the dictionaries public.english_ispell> and
+ pg_catalog.english_stem. It was recognized by
+ public.english_ispell, which reduced it to the noun
+ bright. The word supernovaes is unknown
+ by the public.english_ispell dictionary so it was passed to
+ the next dictionary, and, fortunately, was recognized (in fact,
+ public.english_stem is a stemming dictionary and recognizes
+ everything; that is why it was placed at the end of the dictionary stack).
+
+
+ The word The was recognized by public.english_ispell
+ dictionary as a stop word () and will not be indexed.
+
+
+ You can always explicitly specify which columns you want to see:
+
SELECT "Alias", "Token", "Lexized token"
FROM ts_debug('public.english','The Brightest supernovaes');
lword | supernovaes | pg_catalog.english_stem: {supernova}
(5 rows)
-
-
-
-
-
-
Example of Creating a Rule-Based Dictionary
-
-The motivation for this example dictionary is to control the indexing of
-integers (signed and unsigned), and, consequently, to minimize the number
-of unique words which greatly affects to performance of searching.
-
-
-The dictionary accepts two options:
-
-
-The MAXLEN parameter specifies the maximum length of the
-number considered as a 'good' integer. The default value is 6.
-
-
-The REJECTLONG parameter specifies if a 'long' integer
-should be indexed or treated as a stop word. If
-REJECTLONG=FALSE (default),
-the dictionary returns the prefixed part of the integer with length
-MAXLEN. If
-REJECTLONG=TRUE, the dictionary
-considers a long integer as a stop word.
-
-
-
-
-
-
-A similar idea can be applied to the indexing of decimal numbers, for
-example, in the DecDict dictionary. The dictionary
-accepts two options: the MAXLENFRAC parameter specifies
-the maximum length of the fractional part considered as a 'good' decimal.
-The default value is 3. The REJECTLONG parameter
-controls whether a decimal number with a 'long' fractional part should be indexed
-or treated as a stop word. If
-REJECTLONG=FALSE (default),
-the dictionary returns the decimal number with the length of its fraction part
-truncated to MAXLEN. If
-REJECTLONG=TRUE, the dictionary
-considers the number as a stop word. Notice that
-REJECTLONG=FALSE allows the indexing
-of 'shortened' numbers and search results will contain documents with
-shortened numbers.
-
-
-
-Examples:
+
+
+
+
+
+
Example of Creating a Rule-Based Dictionary
+
+ The motivation for this example dictionary is to control the indexing of
+ integers (signed and unsigned), and, consequently, to minimize the number
+ of unique words which greatly affects to performance of searching.
+
+
+ The dictionary accepts two options:
+
+
+
+ The MAXLEN parameter specifies the maximum length of the
+ number considered as a 'good' integer. The default value is 6.
+
+
+
+
+ The REJECTLONG parameter specifies if a 'long' integer
+ should be indexed or treated as a stop word. If
+ REJECTLONG=FALSE (default),
+ the dictionary returns the prefixed part of the integer with length
+ MAXLEN. If
+ REJECTLONG=TRUE, the dictionary
+ considers a long integer as a stop word.
+
+
+
+
+
+
+
+ A similar idea can be applied to the indexing of decimal numbers, for
+ example, in the DecDict dictionary. The dictionary
+ accepts two options: the MAXLENFRAC parameter specifies
+ the maximum length of the fractional part considered as a 'good' decimal.
+ The default value is 3. The REJECTLONG parameter
+ controls whether a decimal number with a 'long' fractional part should be indexed
+ or treated as a stop word. If
+ REJECTLONG=FALSE (default),
+ the dictionary returns the decimal number with the length of its fraction part
+ truncated to MAXLEN. If
+ REJECTLONG=TRUE, the dictionary
+ considers the number as a stop word. Notice that
+ REJECTLONG=FALSE allows the indexing
+ of 'shortened' numbers and search results will contain documents with
+ shortened numbers.
+
+
+ Examples:
+
SELECT ts_lexize('intdict', 11234567890);
ts_lexize
-----------
{112345}
-
-Now, we want to ignore long integers:
+
+
+ Now, we want to ignore long integers:
+
ALTER TEXT SEARCH DICTIONARY intdict (
MAXLEN = 6, REJECTLONG = TRUE
);
+
SELECT ts_lexize('intdict', 11234567890);
ts_lexize
-----------
{}
-
+
+
+ Create contrib/dict_intdict> directory with files
+ dict_tmpl.c>, Makefile>, dict_intdict.sql.in>:
-Create contrib/dict_intdict> directory with files
-dict_tmpl.c>, Makefile>, dict_intdict.sql.in>:
-make && make install
-psql DBNAME < dict_intdict.sql
+$ make && make install
+$ psql DBNAME < dict_intdict.sql
-
+
-This is a dict_tmpl.c> file:
-
+ This is a dict_tmpl.c> file:
+
#include "postgres.h"
#include "utils/ts_public.h"
#include "utils/ts_utils.h"
- typedef struct {
- int maxlen;
- bool rejectlong;
- } DictInt;
+typedef struct {
+ int maxlen;
+ bool rejectlong;
+} DictInt;
- PG_FUNCTION_INFO_V1(dinit_intdict);
- Datum dinit_intdict(PG_FUNCTION_ARGS);
+PG_FUNCTION_INFO_V1(dinit_intdict);
+Datum dinit_intdict(PG_FUNCTION_ARGS);
- Datum
- dinit_intdict(PG_FUNCTION_ARGS) {
- DictInt *d = (DictInt*)malloc( sizeof(DictInt) );
- Map *cfg, *pcfg;
- text *in;
+Datum
+dinit_intdict(PG_FUNCTION_ARGS) {
+ DictInt *d = (DictInt*)malloc( sizeof(DictInt) );
+ Map *cfg, *pcfg;
+ text *in;
- if (!d)
- elog(ERROR, "No memory");
- memset(d, 0, sizeof(DictInt));
+ if (!d)
+ elog(ERROR, "No memory");
+ memset(d, 0, sizeof(DictInt));
- /* Your INIT code */
-/* defaults */
- d->maxlen = 6;
- d->rejectlong = false;
+ /* Your INIT code */
+ /* defaults */
+ d->maxlen = 6;
+ d->rejectlong = false;
- if ( PG_ARGISNULL(0) || PG_GETARG_POINTER(0) == NULL ) /* no options */
- PG_RETURN_POINTER(d);
+ if (PG_ARGISNULL(0) || PG_GETARG_POINTER(0) == NULL) /* no options */
+ PG_RETURN_POINTER(d);
- in = PG_GETARG_TEXT_P(0);
- parse_keyvalpairs(in, &cfg);
- PG_FREE_IF_COPY(in, 0);
- pcfg=cfg;
+ in = PG_GETARG_TEXT_P(0);
+ parse_keyvalpairs(in, &cfg);
+ PG_FREE_IF_COPY(in, 0);
+ pcfg=cfg;
- while (pcfg->key)
+ while (pcfg->key)
+ {
+ if (strcasecmp("MAXLEN", pcfg->key) == 0)
+ d->maxlen=atoi(pcfg->value);
+ else if ( strcasecmp("REJECTLONG", pcfg->key) == 0)
{
- if (strcasecmp("MAXLEN", pcfg->key) == 0)
- d->maxlen=atoi(pcfg->value);
- else if ( strcasecmp("REJECTLONG", pcfg->key) == 0)
- {
- if ( strcasecmp("true", pcfg->value) == 0 )
- d->rejectlong=true;
- else if ( strcasecmp("false", pcfg->value) == 0)
- d->rejectlong=false;
- else
- elog(ERROR,"Unknown value: %s => %s", pcfg->key, pcfg->value);
- }
- else
- elog(ERROR,"Unknown option: %s => %s", pcfg->key, pcfg->value);
-
- pfree(pcfg->key);
- pfree(pcfg->value);
- pcfg++;
+ if ( strcasecmp("true", pcfg->value) == 0 )
+ d->rejectlong=true;
+ else if ( strcasecmp("false", pcfg->value) == 0)
+ d->rejectlong=false;
+ else
+ elog(ERROR,"Unknown value: %s => %s", pcfg->key, pcfg->value);
}
- pfree(cfg);
+ else
+ elog(ERROR,"Unknown option: %s => %s", pcfg->key, pcfg->value);
- PG_RETURN_POINTER(d);
+ pfree(pcfg->key);
+ pfree(pcfg->value);
+ pcfg++;
+ }
+ pfree(cfg);
+
+ PG_RETURN_POINTER(d);
}
PG_FUNCTION_INFO_V1(dlexize_intdict);
Datum
dlexize_intdict(PG_FUNCTION_ARGS)
{
- DictInt *d = (DictInt*)PG_GETARG_POINTER(0);
- char *in = (char*)PG_GETARG_POINTER(1);
- char *txt = pnstrdup(in, PG_GETARG_INT32(2));
- TSLexeme *res = palloc(sizeof(TSLexeme) * 2);
+ DictInt *d = (DictInt*)PG_GETARG_POINTER(0);
+ char *in = (char*)PG_GETARG_POINTER(1);
+ char *txt = pnstrdup(in, PG_GETARG_INT32(2));
+ TSLexeme *res = palloc(sizeof(TSLexeme) * 2);
- /* Your INIT dictionary code */
- res[1].lexeme = NULL;
+ /* Your INIT dictionary code */
+ res[1].lexeme = NULL;
- if (PG_GETARG_INT32(2) > d->maxlen)
- {
- if (d->rejectlong)
- { /* stop, return void array */
- pfree(txt);
- res[0].lexeme = NULL;
- }
- else
- { /* cut integer */
- txt[d->maxlen] = '\0';
- res[0].lexeme = txt;
- }
+ if (PG_GETARG_INT32(2) > d->maxlen)
+ {
+ if (d->rejectlong)
+ { /* stop, return void array */
+ pfree(txt);
+ res[0].lexeme = NULL;
}
else
- res[0].lexeme = txt;
+ { /* cut integer */
+ txt[d->maxlen] = '\0';
+ res[0].lexeme = txt;
+ }
+ }
+ else
+ res[0].lexeme = txt;
- PG_RETURN_POINTER(res);
+ PG_RETURN_POINTER(res);
}
-This is the Makefile:
+ This is the Makefile:
+
subdir = contrib/dict_intdict
top_builddir = ../..
include $(top_srcdir)/contrib/contrib-global.mk
-
+
+
+ This is a dict_intdict.sql.in:
-This is a dict_intdict.sql.in:
SET default_text_search_config = 'english';
BEGIN;
CREATE OR REPLACE FUNCTION dinit_intdict(internal)
-RETURNS internal
-AS 'MODULE_PATHNAME'
-LANGUAGE 'C';
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE 'C';
CREATE OR REPLACE FUNCTION dlexize_intdict(internal,internal,internal,internal)
-RETURNS internal
-AS 'MODULE_PATHNAME'
-LANGUAGE 'C'
-WITH (isstrict);
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE 'C'
+ WITH (isstrict);
CREATE TEXT SEARCH TEMPLATE intdict_template (
LEXIZE = dlexize_intdict, INIT = dinit_intdict
);
CREATE TEXT SEARCH DICTIONARY intdict (
- TEMPLATE = intdict_template,
- MAXLEN = 6, REJECTLONG = false
+ TEMPLATE = intdict_template,
+ MAXLEN = 6, REJECTLONG = false
);
COMMENT ON TEXT SEARCH DICTIONARY intdict IS 'Dictionary for Integers';
END;
-
-
-
-
-
-
Example of Creating a Parser
-
-
SQL command
CREATE TEXT SEARCH PARSER creates
-a parser for full text searching. In our example we will implement
-a simple parser which recognizes space-delimited words and
-has only two types (3, word, Word; 12, blank, Space symbols). Identifiers
-were chosen to keep compatibility with the default headline() function
-since we do not implement our own version.
-
-
-To implement a parser one needs to create a minimum of four functions.
-
-
-
-
-
-
-
-START = start_function
-
-
-
-Initialize the parser. Arguments are a pointer to the parsed text and its
-length.
-
-Returns a pointer to the internal structure of a parser. Note that it should
-be malloc>ed or palloc>ed in the
-TopMemoryContext>. We name it ParserState>.
-
-
-
-
-
-
-
-GETTOKEN = gettoken_function
-
-
-
-Returns the next token.
-Arguments are ParserState *, char **, int *.
-
-This procedure will be called as long as the procedure returns token type zero.
-
-
-
-
-
-
-
-END = end_function,
-
-
-
-This void function will be called after parsing is finished to free
-allocated resources in this procedure (ParserState>). The argument
-is ParserState *.
-
-
-
-
-
-
-
-LEXTYPES = lextypes_function
-
-
-
-Returns an array containing the id, alias, and the description of the tokens
-in the parser. See LexDescr in src/include/utils/ts_public.h>.
-
-
-
-
-
-
-Below is the source code of our test parser, organized as a contrib> module.
-
-
-Testing:
+
+
+
+
+
+
Example of Creating a Parser
+
+
SQL command
CREATE TEXT SEARCH PARSER creates
+ a parser for full text searching. In our example we will implement
+ a simple parser which recognizes space-delimited words and
+ has only two types (3, word, Word; 12, blank, Space symbols). Identifiers
+ were chosen to keep compatibility with the default headline() function
+ since we do not implement our own version.
+
+
+ To implement a parser one needs to create a minimum of four functions.
+
+
+
+
+
+
+
+ START = start_function
+
+
+
+ Initialize the parser. Arguments are a pointer to the parsed text and its
+ length.
+
+ Returns a pointer to the internal structure of a parser. Note that it should
+ be malloc>ed or palloc>ed in the
+ TopMemoryContext>. We name it ParserState>.
+
+
+
+
+
+
+
+ GETTOKEN = gettoken_function
+
+
+
+ Returns the next token.
+ Arguments are ParserState *, char **, int *.
+
+ This procedure will be called as long as the procedure returns token type zero.
+
+
+
+
+
+
+
+ END = end_function,
+
+
+
+ This void function will be called after parsing is finished to free
+ allocated resources in this procedure (ParserState>). The argument
+ is ParserState *.
+
+
+
+
+
+
+
+ LEXTYPES = lextypes_function
+
+
+
+ Returns an array containing the id, alias, and the description of the tokens
+ in the parser. See LexDescr in src/include/utils/ts_public.h>.
+
+
+
+
+
+
+ Below is the source code of our test parser, organized as a contrib> module.
+
+
+ Testing:
+
SELECT * FROM ts_parse('testparser','That''s my first own parser');
tokid | token
3 | own
12 |
3 | parser
+
SELECT to_tsvector('testcfg','That''s my first own parser');
to_tsvector
-------------------------------------------------
'my':2 'own':4 'first':3 'parser':5 'that''s':1
+
SELECT ts_headline('testcfg','Supernovae stars are the brightest phenomena in galaxies', to_tsquery('testcfg', 'star'));
headline
-----------------------------------------------------------------
Supernovae <b>stars</b> are the brightest phenomena in galaxies
-
+
-This test parser is an example adopted from a tutorial by Valli,
-url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html">parser
-HOWTO.
-
+ This test parser is an example adopted from a tutorial by Valli,
+ url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html">parser
+ HOWTO.
+
+
+ To compile the example just do:
-To compile the example just do:
-make
-make install
-psql regression < test_parser.sql
+$ make
+$ make install
+$ psql regression < test_parser.sql
-
+
+
+ This is a test_parser.c>:
-This is a test_parser.c>:
#ifdef PG_MODULE_MAGIC
/* go to the next white-space character */
while ((pst->buffer)[pst->pos] != ' ' &&
pst->pos < pst->len)
- (pst->pos)++;
+ (pst->pos)++;
}
*tlen = pst->pos - *tlen;
PG_RETURN_INT32(type);
}
+
Datum testprs_end(PG_FUNCTION_ARGS)
{
ParserState *pst = (ParserState *) PG_GETARG_POINTER(0);
-This is a Makefile
+ This is a Makefile
override CPPFLAGS := -I. $(CPPFLAGS)
endif
-This is a test_parser.sql.in:
+ This is a test_parser.sql.in:
SET default_text_search_config = 'english';
BEGIN;
CREATE FUNCTION testprs_start(internal,int4)
-RETURNS internal
-AS 'MODULE_PATHNAME'
-LANGUAGE 'C' with (isstrict);
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE 'C' with (isstrict);
CREATE FUNCTION testprs_getlexeme(internal,internal,internal)
-RETURNS internal
-AS 'MODULE_PATHNAME'
-LANGUAGE 'C' with (isstrict);
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE 'C' with (isstrict);
CREATE FUNCTION testprs_end(internal)
-RETURNS void
-AS 'MODULE_PATHNAME'
-LANGUAGE 'C' with (isstrict);
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+ LANGUAGE 'C' with (isstrict);
CREATE FUNCTION testprs_lextype(internal)
-RETURNS internal
-AS 'MODULE_PATHNAME'
-LANGUAGE 'C' with (isstrict);
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE 'C' with (isstrict);
CREATE TEXT SEARCH PARSER testparser (
- START = testprs_start,
- GETTOKEN = testprs_getlexeme,
- END = testprs_end,
- LEXTYPES = testprs_lextype
-;
+ START = testprs_start,
+ GETTOKEN = testprs_getlexeme,
+ END = testprs_end,
+ LEXTYPES = testprs_lextype
+);
-CREATE TEXT SEARCH CONFIGURATION testcfg ( PARSER = testparser );
+CREATE TEXT SEARCH CONFIGURATION testcfg (PARSER = testparser);
ALTER TEXT SEARCH CONFIGURATION testcfg ADD MAPPING FOR word WITH simple;
END;
-
+
-
+