+url="http://snowball.tartarus.org">Snowball site for more
+information). Full text searching contains a large number of stemmers for
+many languages. The only option that is accepted by a snowball stemmer is the
+location of a file with stop words. It can be defined using the
+ALTER TEXT SEARCH DICTIONARY command.
+
+ALTER TEXT SEARCH DICTIONARY en_stem
+ SET OPTION 'StopFile=english-utf8.stop, Language=english';
+
+
+
+Relative paths in OPTION resolve relative
+share/dicts/data>:
+ALTER TEXT SEARCH DICTIONARY en_stem OPTION 'english.stop';
+
+
+
+The
Snowball> dictionary recognizes everything, so it is best
+to place it at the end of the dictionary stack. It it useless to have it
+before any other dictionary because a lexeme will not pass through its stemmer.
+
+
+
+
+
+
Dictionary Testing
+
+The lexize> function facilitates dictionary testing:
+
+
+
+
+
+
+
+
+
+lexize( dict_name text, lexeme text) returns text[]
+
+
+
+
+Returns an array of lexemes if the input lexeme
+is known to the dictionary dictname, or a void
+array if the lexeme is known to the dictionary but it is a stop word, or
+NULL if it is an unknown word.
+
+SELECT lexize('en_stem', 'stars');
+ lexize
+--------
+ {star}
+SELECT lexize('en_stem', 'a');
+ lexize
+--------
+ {}
+
+
+
+
+
+
+
+
+The lexize function expects a
+lexeme, not text. Below is an example:
+SELECT lexize('thesaurus_astro','supernovae stars') is null;
+ ?column?
+----------
+ t
+
+Thesaurus dictionary thesaurus_astro does know
+supernovae stars, but lexize fails since it does not
+parse the input text and considers it as a single lexeme. Use
+plainto_tsquery> and to_tsvector> to test thesaurus
+dictionaries:
+SELECT plainto_tsquery('supernovae stars');
+ plainto_tsquery
+-----------------
+ 'sn'
+
+
+
+
+
+
+
+
Configuration Example
+
+A full text configuration specifies all options necessary to transform a
+document into a tsvector: the parser breaks text into tokens,
+and the dictionaries transform each token into a lexeme. Every call to
+to_tsvector() and to_tsquery()
+needs a configuration to perform its processing. To facilitate management
+of full text searching objects, a set of
SQL commands
+is available, and there are several psql commands which display information
+about full text searching objects ().
+
+
+The
GUC variable
default_text_search_config
+(optionally schema-qualified) defines the name of the current
+active configuration. It can be defined in
+postgresql.conf or using the SET> command.
+
+
+Predefined full text searching objects are available in the
+pg_catalog schema. If you need a custom configuration
+you can create a new full text searching object and modify it using SQL
+commands.
+
+New full text searching objects are created in the current schema by default
+(usually the public schema), but a schema-qualified
+name can be used to create objects in the specified schema. It is owned
+by the current user and can be changed using the ALTER TEXT
+SEARCH OWNER> command.
+
+
+As an example, we will create a configuration
+pg which starts as a duplicate of the
+english> configuration. To be safe, we do this in a transaction:
+BEGIN;
+
+CREATE TEXT SEARCH CONFIGURATION public.pg LIKE english WITH MAP;
+
+
+
+We will use a PostgreSQL-specific synonym dictionary
+and store it in the share/dicts_data directory. The
+dictionary looks like:
+postgres pg
+pgsql pg
+postgresql pg
+
+
+CREATE TEXT SEARCH DICTIONARY pg_dict
+ TEMPLATE synonym
+ OPTION 'pg_dict.txt';
+
+
+
+
+Then register the
ispell> dictionary en_ispell using
+the ispell_template template:
+
+CREATE TEXT SEARCH DICTIONARY en_ispell
+ TEMPLATE ispell_template
+ OPTION 'DictFile="english-utf8.dict",
+ AffFile="english-utf8.aff",
+ StopFile="english-utf8.stop"';
+
+
+
+Use the same stop word list for the
Snowball> stemmer en_stem,
+which is available by default:
+
+ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'english-utf8.stop';
+
+
+
+Modify mappings for Latin words for configuration 'pg'>:
+
+ALTER TEXT SEARCH CONFIGURATION pg ALTER MAPPING FOR lword, lhword, lpart_hword
+ WITH pg_dict, en_ispell, en_stem;
+
+
+
+We do not index or search some tokens:
+
+ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, float;
+
+
+
+Now, we can test our configuration:
+SELECT * FROM ts_debug('public.pg', '
+PostgreSQL, the highly scalable, SQL compliant, open source object-relational
+database management system, is now undergoing beta testing of the next
+version of our software: PostgreSQL 8.2.
+');
+
+COMMIT;
+
+
+
+With the dictionaries and mappings set up, suppose we have a table
+pgweb which contains 11239 documents from the
+
PostgreSQL web site. Only relevant columns
+are shown:
+=> \d pgweb
+ Table "public.pgweb"
+ Column | Type | Modifiers
+-----------+-------------------+-----------
+ tid | integer | not null
+ path | character varying | not null
+ body | character varying |
+ title | character varying |
+ dlm | integer |
+
+
+
+The next step is to set the session to use the new configuration, which was
+created in the public> schema:
+=> \dF
+postgres=# \dF public.*
+List of fulltext configurations
+ Schema | Name | Description
+--------+------+-------------
+ public | pg |
+
+SET default_text_search_config = 'public.pg';
+SET
+
+SHOW default_text_search_config;
+ default_text_search_config
+----------------------------
+ public.pg
+
+
+
+
+
+
+
Managing Multiple Configurations
+
+If you are using the same text search configuration for the entire cluster
+just set the value in postgresql.conf>. If using a single
+text search configuration for an entire database, use ALTER
+DATABASE ... SET>.
+
+
+However, if you need to use several text search configurations in the same
+database you must be careful to reference the proper text search
+configuration. This can be done by either setting
+default_text_search_conf> in each session or supplying the
+configuration name in every function call, e.g. to_tsquery('pg',
+'friend'), to_tsvector('pg', col). If you are using an expression index,
+you must also be sure to use the proper text search configuration every
+time an INSERT> or UPDATE> is executed because these
+will modify the index, or you can embed the configuration name into the
+expression index, e.g.:
+CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('pg', textcat(title, body)));
+
+And if you do that, make sure you specify the configuration name in the
+WHERE> clause as well so the expression index will be used.
+
+
+
+
+
+
+
+
GiST and GIN Index Types
+
+
+ full text
+
+
+
+There are two kinds of indexes which can be used to speed up full text
+operators ().
+Note that indexes are not mandatory for full text searching.
+
+
+
+
+
+
+
+GIST
+
+
+
+
+CREATE INDEX name ON table USING gist(column);
+
+
+
+
+Creates a GiST (Generalized Search Tree)-based index.
+
+
+
+
+
+
+
+
+GIN
+
+
+
+
+CREATE INDEX name ON table USING gin(column);
+
+
+
+
+Creates a GIN (Generalized Inverted Index)-based index.
+column is a
+TSVECTOR, TEXT,
+VARCHAR, or CHAR-type column.
+
+
+
+
+
+
+
+
+A GiST index is lossy, meaning it is necessary
+to consult the heap to check for false results.
+
PostgreSQL does this automatically; see
+Filter: in the example below:
+EXPLAIN SELECT * FROM apod WHERE textsearch @@ to_tsquery('supernovae');
+ QUERY PLAN
+-------------------------------------------------------------------------
+ Index Scan using textsearch_gidx on apod (cost=0.00..12.29 rows=2 width=1469)
+ Index Cond: (textsearch @@ '''supernova'''::tsquery)
+ Filter: (textsearch @@ '''supernova'''::tsquery)
+
+GiST index lossiness happens because each document is represented by a
+fixed-length signature. The signature is generated by hashing (crc32) each
+word into a random bit in an n-bit string and all words combine to produce
+an n-bit document signature. Because of hashing there is a chance that
+some words hash to the same position and could result in a false hit.
+Signatures calculated for each document in a collection are stored in an
+RD-tree (Russian Doll tree), invented by Hellerstein,
+which is an adaptation of R-tree for sets. In our case
+the transitive containment relation is realized by
+superimposed coding (Knuth, 1973) of signatures, i.e., a parent is the
+result of 'OR'-ing the bit-strings of all children. This is a second
+factor of lossiness. It is clear that parents tend to be full of
+'1'>s (degenerates) and become quite useless because of the
+limited selectivity. Searching is performed as a bit comparison of a
+signature representing the query and an RD-tree entry.
+If all '1'>s of both signatures are in the same position we
+say that this branch probably matches the query, but if there is even one
+discrepancy we can definitely reject this branch.
+
+
+Lossiness causes serious performance degradation since random access of
+heap records is slow and limits the usefulness of GiST
+indexes. The likelihood of false hits depends on several factors, like
+the number of unique words, so using dictionaries to reduce this number
+is recommended.
+
+
+Actually, this is not the whole story. GiST indexes have an optimization
+for storing small tsvectors (< TOAST_INDEX_TARGET
+bytes, 512 bytes). On leaf pages small tsvectors are stored unchanged,
+while longer ones are represented by their signatures, which introduces
+some lossiness. Unfortunately, the existing index API does not allow for
+a return value to say whether it found an exact value (tsvector) or whether
+the result needs to be checked. This is why the GiST index is
+currently marked as lossy. We hope to improve this in the future.
+
+
+GIN indexes are not lossy but their performance depends logarithmically on
+the number of unique words.
+
+
+There is one side-effect of the non-lossiness of a GIN index when using
+query labels/weights, like 'supernovae:a'. A GIN index
+has all the information necessary to determine a match, so the heap is
+not accessed. However, if the query has label information it must access
+the heap. Therefore, a special full text search operator @@@
+was created which forces the use of the heap to get information about
+labels. GiST indexes are lossy so it always reads the heap and there is
+no need for a special operator. In the example below,
+fulltext_idx is a GIN index:
+EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
+ QUERY PLAN
+------------------------------------------------------------------------
+ Index Scan using textsearch_idx on apod (cost=0.00..12.30 rows=2 width=1469)
+ Index Cond: (textsearch @@@ '''supernova'':A'::tsquery)
+ Filter: (textsearch @@@ '''supernova'':A'::tsquery)
+
+
+
+
+In choosing which index type to use, GiST or GIN, consider these differences:
+
+GiN index lookups are three times faster than GiST
+
+GiN indexes take three times longer to build than GiST
+
+GiN is about ten times slower to update than GiST
+
+GiN indexes are two-to-three times larger than GiST
+
+
+
+
+In summary,
GIN indexes are best for static data because
+the indexes are faster for lookups. For dynamic data, GiST indexes are
+faster to update. Specifically,
GiST indexes are very
+good for dynamic data and fast if the number of unique words (lexemes) is
+under 100,000, while
GIN handles +100,000 lexemes better
+but is slower to update.
+
+
+Partitioning of big collections and the proper use of GiST and GIN indexes
+allows the implementation of very fast searches with online update.
+Partitioning can be done at the database level using table inheritance
+and constraint_exclusion>, or distributing documents over
+servers and collecting search results using the contrib/dblink>
+extension module. The latter is possible because ranking functions use
+only local information.
+
+
+
+
+
+
Limitations
+
+The current limitations of Full Text Searching are:
+
+
The length of each lexeme must be less than 2K bytes
+
The length of a tsvector (lexemes + positions) must be less than 1 megabyte
+
The number of lexemes must be less than 264
+
Positional information must be non-negative and less than 16,383
+
No more than 256 positions per lexeme
+
The number of nodes (lexemes + operations) in tsquery must be less than 32,768
+
+
+
+For comparison, the
PostgreSQL 8.1 documentation
+consists of 10,441 unique words, a total of 335,420 words, and the most frequent word
+'postgresql' is mentioned 6,127 times in 655 documents.
+
+
+Another example - the
PostgreSQL mailing list archives
+consists of 910,989 unique words with 57,491,343 lexemes in 461,020 messages.
+
+
+
+
+
+
+Information about full text searching objects can be obtained
+in psql using a set of commands:
+
+\dF{,d,p}+ PATTERN
+
+An optional + produces more details.
+
+The optional parameter PATTERN should be the name of
+a full text searching object, optionally schema-qualified. If
+PATTERN is not specified then information about all
+visible objects will be displayed. PATTERN can be a
+regular expression and can apply separately to schema
+names and object names. The following examples illustrate this:
+=> \dF *fulltext*
+ List of fulltext configurations
+ Schema | Name | Description
+--------+--------------+-------------
+ public | fulltext_cfg |
+
+
+=> \dF *.fulltext*
+ List of fulltext configurations
+ Schema | Name | Description
+----------+----------------------------
+ fulltext | fulltext_cfg |
+ public | fulltext_cfg |
+
+
+
+
+
+
+\dF[+] [PATTERN]
+
+
+ List full text searching configurations (add "+" for more detail)
+
+ By default (without PATTERN), information about
+ all visible full text configurations will be
+ displayed.
+
+=> \dF russian
+ List of fulltext configurations
+ Schema | Name | Description
+------------+---------+-----------------------------------
+ pg_catalog | russian | default configuration for Russian
+
+=> \dF+ russian
+Configuration "pg_catalog.russian"
+Parser name: "pg_catalog.default"
+Locale: 'ru_RU.UTF-8' (default)
+ Token | Dictionaries
+--------------+-------------------------
+ email | pg_catalog.simple
+ file | pg_catalog.simple
+ float | pg_catalog.simple
+ host | pg_catalog.simple
+ hword | pg_catalog.ru_stem_utf8
+ int | pg_catalog.simple
+ lhword | public.tz_simple
+ lpart_hword | public.tz_simple
+ lword | public.tz_simple
+ nlhword | pg_catalog.ru_stem_utf8
+ nlpart_hword | pg_catalog.ru_stem_utf8
+ nlword | pg_catalog.ru_stem_utf8
+ part_hword | pg_catalog.simple
+ sfloat | pg_catalog.simple
+ uint | pg_catalog.simple
+ uri | pg_catalog.simple
+ url | pg_catalog.simple
+ version | pg_catalog.simple
+ word | pg_catalog.ru_stem_utf8
+
+
+
+
+
+
+\dFd[+] [PATTERN]
+
+ List full text dictionaries (add "+" for more detail).
+
+ By default (without PATTERN), information about
+ all visible dictionaries will be displayed.
+
+=> \dFd
+ List of fulltext dictionaries
+ Schema | Name | Description
+------------+------------+-----------------------------------------------------------
+ pg_catalog | danish | Snowball stemmer for danish language
+ pg_catalog | dutch | Snowball stemmer for dutch language
+ pg_catalog | english | Snowball stemmer for english language
+ pg_catalog | finnish | Snowball stemmer for finnish language
+ pg_catalog | french | Snowball stemmer for french language
+ pg_catalog | german | Snowball stemmer for german language
+ pg_catalog | hungarian | Snowball stemmer for hungarian language
+ pg_catalog | italian | Snowball stemmer for italian language
+ pg_catalog | norwegian | Snowball stemmer for norwegian language
+ pg_catalog | portuguese | Snowball stemmer for portuguese language
+ pg_catalog | romanian | Snowball stemmer for romanian language
+ pg_catalog | russian | Snowball stemmer for russian language
+ pg_catalog | simple | simple dictionary: just lower case and check for stopword
+ pg_catalog | spanish | Snowball stemmer for spanish language
+ pg_catalog | swedish | Snowball stemmer for swedish language
+ pg_catalog | turkish | Snowball stemmer for turkish language
+
+
+
+
+
+
+
+\dFp[+] [PATTERN]
+
+ List full text parsers (add "+" for more detail)
+
+ By default (without PATTERN), information about
+ all visible full text parsers will be displayed.
+
+=> \dFp
+ List of fulltext parsers
+ Schema | Name | Description
+------------+---------+---------------------
+ pg_catalog | default | default word parser
+(1 row)
+=> \dFp+
+ Fulltext parser "pg_catalog.default"
+ Method | Function | Description
+-------------------+---------------------------+-------------
+ Start parse | pg_catalog.prsd_start |
+ Get next token | pg_catalog.prsd_nexttoken |
+ End parse | pg_catalog.prsd_end |
+ Get headline | pg_catalog.prsd_headline |
+ Get lexeme's type | pg_catalog.prsd_lextype |
+
+ Token's types for parser "pg_catalog.default"
+ Token name | Description
+--------------+-----------------------------------
+ blank | Space symbols
+ email | Email
+ entity | HTML Entity
+ file | File or path name
+ float | Decimal notation
+ host | Host
+ hword | Hyphenated word
+ int | Signed integer
+ lhword | Latin hyphenated word
+ lpart_hword | Latin part of hyphenated word
+ lword | Latin word
+ nlhword | Non-latin hyphenated word
+ nlpart_hword | Non-latin part of hyphenated word
+ nlword | Non-latin word
+ part_hword | Part of hyphenated word
+ protocol | Protocol head
+ sfloat | Scientific notation
+ tag | HTML Tag
+ uint | Unsigned integer
+ uri | URI
+ url | URL
+ version | VERSION
+ word | Word
+(23 rows)
+
+
+
+
+
+
+
+
+
+
+
+
Debugging
+
+Function ts_debug allows easy testing of your full text searching
+configuration.
+
+
+
+ts_debug(conf_name, document TEXT) returns SETOF tsdebug
+
+
+ts_debug> displays information about every token of
+document as produced by the
+parser and processed by the configured dictionaries using the configuration
+specified by conf_name.
+
+tsdebug type defined as:
+CREATE TYPE tsdebug AS (
+ "Alias" text,
+ "Description" text,
+ "Token" text,
+ "Dicts list" text[],
+ "Lexized token" text
+
+
+
+For a demonstration of how function ts_debug works we
+first create a public.english configuration and
+ispell dictionary for the English language. You can skip the test step and
+play with the standard english configuration.
+
+CREATE TEXT SEARCH CONFIGURATION public.english LIKE pg_catalog.english WITH MAP AS DEFAULT;
+CREATE TEXT SEARCH DICTIONARY en_ispell
+ TEMPLATE ispell_template
+ OPTION 'DictFile="/usr/local/share/dicts/ispell/english-utf8.dict",
+ AffFile="/usr/local/share/dicts/ispell/english-utf8.aff",
+ StopFile="/usr/local/share/dicts/english.stop"';
+ALTER TEXT SEARCH MAPPING ON public.english FOR lword WITH en_ispell,en_stem;
+
+
+SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
+ Alias | Description | Token | Dicts list | Lexized token
+-------+---------------+-------------+---------------------------------------+---------------------------------
+ lword | Latin word | The | {public.en_ispell,pg_catalog.en_stem} | public.en_ispell: {}
+ blank | Space symbols | | |
+ lword | Latin word | Brightest | {public.en_ispell,pg_catalog.en_stem} | public.en_ispell: {bright}
+ blank | Space symbols | | |
+ lword | Latin word | supernovaes | {public.en_ispell,pg_catalog.en_stem} | pg_catalog.en_stem: {supernova}
+(5 rows)
+
+In this example, the word 'Brightest'> was recognized by a
+parser as a Latin word (alias lword)
+and came through the dictionaries public.en_ispell> and
+pg_catalog.en_stem. It was recognized by
+public.en_ispell, which reduced it to the noun
+bright. The word supernovaes is unknown
+by the public.en_ispell dictionary so it was passed to
+the next dictionary, and, fortunately, was recognized (in fact,
+public.en_stem is a stemming dictionary and recognizes
+everything; that is why it was placed at the end of the dictionary stack).
+
+
+The word The was recognized by public.en_ispell
+dictionary as a stop word () and will not be indexed.
+
+
+You can always explicitly specify which columns you want to see:
+SELECT "Alias", "Token", "Lexized token"
+FROM ts_debug('public.english','The Brightest supernovaes');
+ Alias | Token | Lexized token
+-------+-------------+---------------------------------
+ lword | The | public.en_ispell: {}
+ blank | |
+ lword | Brightest | public.en_ispell: {bright}
+ blank | |
+ lword | supernovaes | pg_catalog.en_stem: {supernova}
+(5 rows)
+
+
+
+
+
+
+
Example of Creating a Rule-Based Dictionary
+
+The motivation for this example dictionary is to control the indexing of
+integers (signed and unsigned), and, consequently, to minimize the number
+of unique words which greatly affects to performance of searching.
+
+
+The dictionary accepts two options:
+
+
+The MAXLEN parameter specifies the maximum length of the
+number considered as a 'good' integer. The default value is 6.
+
+
+The REJECTLONG parameter specifies if a 'long' integer
+should be indexed or treated as a stop word. If
+REJECTLONG=FALSE (default),
+the dictionary returns the prefixed part of the integer with length
+MAXLEN. If
+REJECTLONG=TRUE, the dictionary
+considers a long integer as a stop word.
+
+
+
+
+
+
+A similar idea can be applied to the indexing of decimal numbers, for
+example, in the DecDict dictionary. The dictionary
+accepts two options: the MAXLENFRAC parameter specifies
+the maximum length of the fractional part considered as a 'good' decimal.
+The default value is 3. The REJECTLONG parameter
+controls whether a decimal number with a 'long' fractional part should be indexed
+or treated as a stop word. If
+REJECTLONG=FALSE (default),
+the dictionary returns the decimal number with the length of its fraction part
+truncated to MAXLEN. If
+REJECTLONG=TRUE, the dictionary
+considers the number as a stop word. Notice that
+REJECTLONG=FALSE allows the indexing
+of 'shortened' numbers and search results will contain documents with
+shortened numbers.
+
+
+
+Examples:
+SELECT lexize('intdict', 11234567890);
+ lexize
+----------
+ {112345}
+
+
+Now, we want to ignore long integers:
+
+ALTER TEXT SEARCH DICTIONARY intdict SET OPTION 'MAXLEN=6, REJECTLONG=TRUE';
+SELECT lexize('intdict', 11234567890);
+ lexize
+--------
+ {}
+
+
+
+Create contrib/dict_intdict> directory with files
+dict_tmpl.c>, Makefile>, dict_intdict.sql.in>:
+make && make install
+psql DBNAME < dict_intdict.sql
+
+
+
+This is a dict_tmpl.c> file:
+
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "fmgr.h"
+
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
+
+#include "utils/ts_locale.h"
+#include "utils/ts_public.h"
+#include "utils/ts_utils.h"
+
+ typedef struct {
+ int maxlen;
+ bool rejectlong;
+ } DictInt;
+
+
+ PG_FUNCTION_INFO_V1(dinit_intdict);
+ Datum dinit_intdict(PG_FUNCTION_ARGS);
+
+ Datum
+ dinit_intdict(PG_FUNCTION_ARGS) {
+ DictInt *d = (DictInt*)malloc( sizeof(DictInt) );
+ Map *cfg, *pcfg;
+ text *in;
+
+ if (!d)
+ elog(ERROR, "No memory");
+ memset(d, 0, sizeof(DictInt));
+
+ /* Your INIT code */
+/* defaults */
+ d->maxlen = 6;
+ d->rejectlong = false;
+
+ if ( PG_ARGISNULL(0) || PG_GETARG_POINTER(0) == NULL ) /* no options */
+ PG_RETURN_POINTER(d);
+
+ in = PG_GETARG_TEXT_P(0);
+ parse_keyvalpairs(in, &cfg);
+ PG_FREE_IF_COPY(in, 0);
+ pcfg=cfg;
+
+ while (pcfg->key)
+ {
+ if (strcasecmp("MAXLEN", pcfg->key) == 0)
+ d->maxlen=atoi(pcfg->value);
+ else if ( strcasecmp("REJECTLONG", pcfg->key) == 0)
+ {
+ if ( strcasecmp("true", pcfg->value) == 0 )
+ d->rejectlong=true;
+ else if ( strcasecmp("false", pcfg->value) == 0)
+ d->rejectlong=false;
+ else
+ elog(ERROR,"Unknown value: %s => %s", pcfg->key, pcfg->value);
+ }
+ else
+ elog(ERROR,"Unknown option: %s => %s", pcfg->key, pcfg->value);
+
+ pfree(pcfg->key);
+ pfree(pcfg->value);
+ pcfg++;
+ }
+ pfree(cfg);
+
+ PG_RETURN_POINTER(d);
+ }
+
+PG_FUNCTION_INFO_V1(dlexize_intdict);
+Datum dlexize_intdict(PG_FUNCTION_ARGS);
+Datum
+dlexize_intdict(PG_FUNCTION_ARGS)
+{
+ DictInt *d = (DictInt*)PG_GETARG_POINTER(0);
+ char *in = (char*)PG_GETARG_POINTER(1);
+ char *txt = pnstrdup(in, PG_GETARG_INT32(2));
+ TSLexeme *res = palloc(sizeof(TSLexeme) * 2);
+
+ /* Your INIT dictionary code */
+ res[1].lexeme = NULL;
+
+ if (PG_GETARG_INT32(2) > d->maxlen)
+ {
+ if (d->rejectlong)
+ { /* stop, return void array */
+ pfree(txt);
+ res[0].lexeme = NULL;
+ }
+ else
+ { /* cut integer */
+ txt[d->maxlen] = '\0';
+ res[0].lexeme = txt;
+ }
+ }
+ else
+ res[0].lexeme = txt;
+
+ PG_RETURN_POINTER(res);
+}
+
+
+This is the Makefile:
+subdir = contrib/dict_intdict
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+
+MODULE_big = dict_intdict
+OBJS = dict_tmpl.o
+DATA_built = dict_intdict.sql
+DOCS =
+
+include $(top_srcdir)/contrib/contrib-global.mk
+
+
+
+This is a dict_intdict.sql.in:
+SET default_text_search_config = 'english';
+
+BEGIN;
+
+CREATE OR REPLACE FUNCTION dinit_intdict(internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE 'C';
+
+CREATE OR REPLACE FUNCTION dlexize_intdict(internal,internal,internal,internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE 'C'
+WITH (isstrict);
+
+CREATE TEXT SEARCH DICTIONARY intdict
+ LEXIZE 'dlexize_intdict' INIT 'dinit_intdict'
+ OPTION 'MAXLEN=6,REJECTLONG = false';
+
+COMMENT ON TEXT SEARCH DICTIONARY intdict IS 'Dictionary for Integers';
+
+END;
+
+
+
+
+
+
+
Example of Creating a Parser
+
+
SQL command
CREATE TEXT SEARCH PARSER creates
+a parser for full text searching. In our example we will implement
+a simple parser which recognizes space-delimited words and
+has only two types (3, word, Word; 12, blank, Space symbols). Identifiers
+were chosen to keep compatibility with the default headline() function
+since we do not implement our own version.
+
+
+To implement a parser one needs to create a minimum of four functions.
+
+
+
+
+
+
+
+START = start_function
+
+
+
+Initialize the parser. Arguments are a pointer to the parsed text and its
+length.
+
+Returns a pointer to the internal structure of a parser. Note that it should
+be malloc>ed or palloc>ed in the
+TopMemoryContext>. We name it ParserState>.
+
+
+
+
+
+
+
+GETTOKEN = gettoken_function
+
+
+
+Returns the next token.
+Arguments are ParserState *, char **, int *.
+
+This procedure will be called as long as the procedure returns token type zero.
+
+
+
+
+
+
+
+END = end_function,
+
+
+
+This void function will be called after parsing is finished to free
+allocated resources in this procedure (ParserState>). The argument
+is ParserState *.
+
+
+
+
+
+
+
+LEXTYPES = lextypes_function
+
+
+
+Returns an array containing the id, alias, and the description of the tokens
+in the parser. See LexDescr in src/include/utils/ts_public.h>.
+
+
+
+
+
+
+Below is the source code of our test parser, organized as a contrib> module.
+
+
+Testing:
+SELECT * FROM parse('testparser','That''s my first own parser');
+ tokid | token
+-------+--------
+ 3 | That's
+ 12 |
+ 3 | my
+ 12 |
+ 3 | first
+ 12 |
+ 3 | own
+ 12 |
+ 3 | parser
+SELECT to_tsvector('testcfg','That''s my first own parser');
+ to_tsvector
+-------------------------------------------------
+ 'my':2 'own':4 'first':3 'parser':5 'that''s':1
+SELECT headline('testcfg','Supernovae stars are the brightest phenomena in galaxies', to_tsquery('testcfg', 'star'));
+ headline
+-----------------------------------------------------------------
+ Supernovae <b>stars</b> are the brightest phenomena in galaxies
+
+
+
+
+This test parser is an example adopted from a tutorial by Valli,
+url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html">parser
+HOWTO.
+
+
+To compile the example just do:
+make
+make install
+psql regression < test_parser.sql
+
+
+
+This is a test_parser.c>:
+
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
+
+/*
+ * types
+ */
+
+/* self-defined type */
+typedef struct {
+ char * buffer; /* text to parse */
+ int len; /* length of the text in buffer */
+ int pos; /* position of the parser */
+} ParserState;
+
+/* copy-paste from wparser.h of tsearch2 */
+typedef struct {
+ int lexid;
+ char *alias;
+ char *descr;
+} LexDescr;
+
+/*
+ * prototypes
+ */
+PG_FUNCTION_INFO_V1(testprs_start);
+Datum testprs_start(PG_FUNCTION_ARGS);
+
+PG_FUNCTION_INFO_V1(testprs_getlexeme);
+Datum testprs_getlexeme(PG_FUNCTION_ARGS);
+
+PG_FUNCTION_INFO_V1(testprs_end);
+Datum testprs_end(PG_FUNCTION_ARGS);
+
+PG_FUNCTION_INFO_V1(testprs_lextype);
+Datum testprs_lextype(PG_FUNCTION_ARGS);
+
+/*
+ * functions
+ */
+Datum testprs_start(PG_FUNCTION_ARGS)
+{
+ ParserState *pst = (ParserState *) palloc(sizeof(ParserState));
+ pst->buffer = (char *) PG_GETARG_POINTER(0);
+ pst->len = PG_GETARG_INT32(1);
+ pst->pos = 0;
+
+ PG_RETURN_POINTER(pst);
+}
+
+Datum testprs_getlexeme(PG_FUNCTION_ARGS)
+{
+ ParserState *pst = (ParserState *) PG_GETARG_POINTER(0);
+ char **t = (char **) PG_GETARG_POINTER(1);
+ int *tlen = (int *) PG_GETARG_POINTER(2);
+ int type;
+
+ *tlen = pst->pos;
+ *t = pst->buffer + pst->pos;
+
+ if ((pst->buffer)[pst->pos] == ' ')
+ {
+ /* blank type */
+ type = 12;
+ /* go to the next non-white-space character */
+ while ((pst->buffer)[pst->pos] == ' ' &&
+ pst->pos < pst->len)
+ (pst->pos)++;
+ } else {
+ /* word type */
+ type = 3;
+ /* go to the next white-space character */
+ while ((pst->buffer)[pst->pos] != ' ' &&
+ pst->pos < pst->len)
+ (pst->pos)++;
+ }
+
+ *tlen = pst->pos - *tlen;
+
+ /* we are finished if (*tlen == 0) */
+ if (*tlen == 0)
+ type=0;
+
+ PG_RETURN_INT32(type);
+}
+Datum testprs_end(PG_FUNCTION_ARGS)
+{
+ ParserState *pst = (ParserState *) PG_GETARG_POINTER(0);
+ pfree(pst);
+ PG_RETURN_VOID();
+}
+
+Datum testprs_lextype(PG_FUNCTION_ARGS)
+{
+ /*
+ Remarks:
+ - we have to return the blanks for headline reason
+ - we use the same lexids like Teodor in the default
+ word parser; in this way we can reuse the headline
+ function of the default word parser.
+ */
+ LexDescr *descr = (LexDescr *) palloc(sizeof(LexDescr) * (2+1));
+
+ /* there are only two types in this parser */
+ descr[0].lexid = 3;
+ descr[0].alias = pstrdup("word");
+ descr[0].descr = pstrdup("Word");
+ descr[1].lexid = 12;
+ descr[1].alias = pstrdup("blank");
+ descr[1].descr = pstrdup("Space symbols");
+ descr[2].lexid = 0;
+
+ PG_RETURN_POINTER(descr);
+}
+
+
+
+This is a Makefile
+
+override CPPFLAGS := -I. $(CPPFLAGS)
+
+MODULE_big = test_parser
+OBJS = test_parser.o
+
+DATA_built = test_parser.sql
+DATA =
+DOCS = README.test_parser
+REGRESS = test_parser
+
+
+ifdef USE_PGXS
+PGXS := $(shell pg_config --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_parser
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+
+This is a test_parser.sql.in:
+
+SET default_text_search_config = 'english';
+
+BEGIN;
+
+CREATE FUNCTION testprs_start(internal,int4)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE 'C' with (isstrict);
+
+CREATE FUNCTION testprs_getlexeme(internal,internal,internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE 'C' with (isstrict);
+
+CREATE FUNCTION testprs_end(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE 'C' with (isstrict);
+
+CREATE FUNCTION testprs_lextype(internal)
+RETURNS internal
+AS 'MODULE_PATHNAME'
+LANGUAGE 'C' with (isstrict);
+
+
+CREATE TEXT SEARCH PARSER testparser
+ START 'testprs_start'
+ GETTOKEN 'testprs_getlexeme'
+ END 'testprs_end'
+ LEXTYPES 'testprs_lextype'
+;
+
+CREATE TEXT SEARCH CONFIGURATION testcfg PARSER 'testparser';
+CREATE TEXT SEARCH CONFIGURATION testcfg ADD MAPPING FOR word WITH simple;
+
+END;
+
+
+
+
+
+
+