+ linkend="textsearch-debugging">) is helpful for testing dictionaries.
+
+
Configuration Example
- A full text configuration specifies all options necessary to transform a
+ A text search configuration specifies all options necessary to transform a
document into a tsvector: the parser breaks text into tokens,
and the dictionaries transform each token into a lexeme. Every call to
to_tsvector() and to_tsquery()
needs a configuration to perform its processing. To facilitate management
- of
full text searching objects, a set of
SQL commands
- is available, and there are several psql commands which display information
- about full text searching objects ().
+ of
text search objects, a set of
SQL commands
+ is available, and there are several psql commands that display information
+ about text search objects ().
- Several predefined text searching configurations are available in the
+ Several predefined text search configurations are available in the
pg_catalog schema. If you need a custom configuration
- you can create a new text searching configuration and modify it using SQL
+ you can create a new text search configuration and modify it using SQL
commands.
- New text searching objects are created in the current schema by default
+ New text search objects are created in the current schema by default
(usually the public schema), but a schema-qualified
name can be used to create objects in the specified schema.
CREATE TEXT SEARCH DICTIONARY pg_dict (
- TEMPLATE = synonym
+ TEMPLATE = synonym,
SYNONYMS = pg_dict
);
Now, we can test our configuration:
+COMMIT;
+
SELECT * FROM ts_debug('public.pg', '
PostgreSQL, the highly scalable, SQL compliant, open source object-relational
database management system, is now undergoing beta testing of the next
-version of our software: PostgreSQL 8.3.
+version of our software.
');
-
- COMMIT;
-
-
-
- With the dictionaries and mappings set up, suppose we have a table
- pgweb which contains 11239 documents from the
-
PostgreSQL web site. Only relevant columns
- are shown:
-
-=> \d pgweb
- Table "public.pgweb"
- Column | Type | Modifiers
------------+-------------------+-----------
- tid | integer | not null
- path | character varying | not null
- body | character varying |
- title | character varying |
- dlm | date |
- There are two kinds of indexes which can be used to speed up full text
+ There are two kinds of indexes that can be used to speed up full text
operators ().
Note that indexes are not mandatory for full text searching.
Actually, this is not the whole story. GiST indexes have an optimization
- for storing small tsvectors (< TOAST_INDEX_TARGET
- bytes, 512 bytes). On leaf pages small tsvectors are stored unchanged,
+ for storing small tsvectors (under TOAST_INDEX_TARGET
+ bytes, 512 bytes by default). On leaf pages small tsvectors are stored unchanged,
while longer ones are represented by their signatures, which introduces
some lossiness. Unfortunately, the existing index API does not allow for
a return value to say whether it found an exact value (tsvector) or whether
not accessed. However, label information is not stored in the index,
so if the query involves label weights it must access
the heap. Therefore, a special full text search operator @@@
- was created which forces the use of the heap to get information about
+ was created that forces the use of the heap to get information about
labels. GiST indexes are lossy so it always reads the heap and there is
no need for a special operator. In the example below,
fulltext_idx is a GIN index:
- Another example — the
PostgreSQL mailing list
- archives contained 910,989 unique words with 57,491,343 lexemes in 461,020
- messages.
-
-
-
-
- Information about full text searching objects can be obtained
+ Information about text search configuration objects can be obtained
in
psql using a set of commands:
\dF{d,p,t}+ PATTERN
The optional parameter PATTERN should be the name of
- a text searching object, optionally schema-qualified. If
- PATTERN is not specified then information about all
+ a text search object, optionally schema-qualified. If
+ PATTERN is omitted then information about all
visible objects will be displayed. PATTERN can be a
regular expression and can provide separate patterns
for the schema and object names. The following examples illustrate this:
fulltext | fulltext_cfg |
public | fulltext_cfg |
+
+ The available commands are:
- \dF[+] [PATTERN]
+ \dF+ PATTERN
- List text searching configurations (add +> for more detail).
+ List text search configurations (add +> for more detail).
- \dFd[+] [PATTERN]
+ \dFd+ PATTERN
List text search dictionaries (add +> for more detail).
- \dFp[+] [PATTERN]
+ \dFp+ PATTERN
List text search parsers (add +> for more detail).
- \dFt[+] [PATTERN]
+ \dFt+ PATTERN
List text search templates (add +> for more detail).
+
+
Limitations
+
+ The current limitations of
PostgreSQL's
+ text search features are:
+
+
+
The length of each lexeme must be less than 2K bytes
+
+
+
The length of a tsvector (lexemes + positions) must be less than 1 megabyte
+
+
+
The number of lexemes must be less than 264
+
+
+
Positional information must be greater than 0 and less than 16,383
+
+
+
No more than 256 positions per lexeme
+
+
+
The number of nodes (lexemes + operations) in tsquery must be less than 32,768
+
+
+
+
+ For comparison, the
PostgreSQL 8.1 documentation
+ contained 10,441 unique words, a total of 335,420 words, and the most frequent
+ word postgresql> was mentioned 6,127 times in 655 documents.
+
+
+
+ Another example — the
PostgreSQL mailing list
+ archives contained 910,989 unique words with 57,491,343 lexemes in 461,020
+ messages.
+
+
+
+
Debugging
- Function ts_debug allows easy testing of your full text searching
- configuration.
+ The function ts_debug allows easy testing of a
+ text search configuration.
- ts_debug(config_name, document TEXT) returns SETOF ts_debug
+ ts_debug( config_name, document text) returns SETOF ts_debug
- <replaceable class="PARAMETER">ts_debug>'s result type is defined as:
+ <function>ts_debug>'s result type is defined as:
CREATE TYPE ts_debug AS (
For a demonstration of how function ts_debug works we
first create a public.english configuration and
- ispell dictionary for the English language. You can skip the test step and
- play with the standard english configuration.
+ ispell dictionary for the English language:
SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
- Alias | Description | Token | Dictionaries | Lexized token
--------+---------------+-------------+---------------------------------------+---------------------------------
+ Alias | Description | Token | Dictionaries | Lexized token
+-------+---------------+-------------+-------------------------------------------------+-------------------------------------
lword | Latin word | The | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {}
- blank | Space symbols | | |
+ blank | Space symbols | | |
lword | Latin word | Brightest | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {bright}
- blank | Space symbols | | |
+ blank | Space symbols | | |
lword | Latin word | supernovaes | {public.english_ispell,pg_catalog.english_stem} | pg_catalog.english_stem: {supernova}
(5 rows)
- In this example, the word Brightest> was recognized by a
- parser as a Latin word (alias lword)
- and came through the dictionaries public.english_ispell> and
- pg_catalog.english_stem. It was recognized by
+ In this example, the word Brightest> was recognized by the
+ parser as a Latin word (alias lword).
+ For this token type the dictionary stack is
+ public.english_ispell> and
+ pg_catalog.english_stem. The word was recognized by
public.english_ispell, which reduced it to the noun
bright. The word supernovaes is unknown
- by the public.english_ispell dictionary so it was passed to
+ to the public.english_ispell dictionary so it was passed to
the next dictionary, and, fortunately, was recognized (in fact,
public.english_stem is a stemming dictionary and recognizes
everything; that is why it was placed at the end of the dictionary stack).
SELECT "Alias", "Token", "Lexized token"
FROM ts_debug('public.english','The Brightest supernovaes');
Alias | Token | Lexized token
--------+-------------+---------------------------------
+-------+-------------+--------------------------------------
lword | The | public.english_ispell: {}
blank | |
lword | Brightest | public.english_ispell: {bright}
-
-
Example of Creating a Rule-Based Dictionary
-
- The motivation for this example dictionary is to control the indexing of
- integers (signed and unsigned), and, consequently, to minimize the number
- of unique words which greatly affects to performance of searching.
-
-
- The dictionary accepts two options:
-
-
-
- The MAXLEN parameter specifies the maximum length of the
- number considered as a 'good' integer. The default value is 6.
-
-
-
-
- The REJECTLONG parameter specifies if a 'long' integer
- should be indexed or treated as a stop word. If
- REJECTLONG=FALSE (default),
- the dictionary returns the prefixed part of the integer with length
- MAXLEN. If
- REJECTLONG=TRUE, the dictionary
- considers a long integer as a stop word.
-
-
-
-
-
-
-
- A similar idea can be applied to the indexing of decimal numbers, for
- example, in the DecDict dictionary. The dictionary
- accepts two options: the MAXLENFRAC parameter specifies
- the maximum length of the fractional part considered as a 'good' decimal.
- The default value is 3. The REJECTLONG parameter
- controls whether a decimal number with a 'long' fractional part should be indexed
- or treated as a stop word. If
- REJECTLONG=FALSE (default),
- the dictionary returns the decimal number with the length of its fraction part
- truncated to MAXLEN. If
- REJECTLONG=TRUE, the dictionary
- considers the number as a stop word. Notice that
- REJECTLONG=FALSE allows the indexing
- of 'shortened' numbers and search results will contain documents with
- shortened numbers.
-
-
- Examples:
-
-SELECT ts_lexize('intdict', 11234567890);
- ts_lexize
------------
- {112345}
-
-
-
- Now, we want to ignore long integers:
-
-
-ALTER TEXT SEARCH DICTIONARY intdict (
- MAXLEN = 6, REJECTLONG = TRUE
-);
-
-SELECT ts_lexize('intdict', 11234567890);
- ts_lexize
------------
- {}
-
-
-
- Create contrib/dict_intdict> directory with files
- dict_tmpl.c>, Makefile>, dict_intdict.sql.in>:
-
-$ make && make install
-$ psql DBNAME < dict_intdict.sql
-
-
-
- This is a dict_tmpl.c> file:
-
-
-#include "postgres.h"
-#include "utils/builtins.h"
-#include "fmgr.h"
-
-#ifdef PG_MODULE_MAGIC
-PG_MODULE_MAGIC;
-#endif
-
-#include "tsearch/ts_locale.h"
-#include "tsearch/ts_public.h"
-#include "tsearch/ts_utils.h"
-
-typedef struct {
- int maxlen;
- bool rejectlong;
-} DictInt;
-
-
-PG_FUNCTION_INFO_V1(dinit_intdict);
-Datum dinit_intdict(PG_FUNCTION_ARGS);
-
-Datum
-dinit_intdict(PG_FUNCTION_ARGS) {
- DictInt *d = (DictInt*)malloc( sizeof(DictInt) );
- Map *cfg, *pcfg;
- text *in;
-
- if (!d)
- elog(ERROR, "No memory");
- memset(d, 0, sizeof(DictInt));
-
- /* Your INIT code */
- /* defaults */
- d->maxlen = 6;
- d->rejectlong = false;
-
- if (PG_ARGISNULL(0) || PG_GETARG_POINTER(0) == NULL) /* no options */
- PG_RETURN_POINTER(d);
-
- in = PG_GETARG_TEXT_P(0);
- parse_keyvalpairs(in, &cfg);
- PG_FREE_IF_COPY(in, 0);
- pcfg=cfg;
-
- while (pcfg->key)
- {
- if (strcasecmp("MAXLEN", pcfg->key) == 0)
- d->maxlen=atoi(pcfg->value);
- else if ( strcasecmp("REJECTLONG", pcfg->key) == 0)
- {
- if ( strcasecmp("true", pcfg->value) == 0 )
- d->rejectlong=true;
- else if ( strcasecmp("false", pcfg->value) == 0)
- d->rejectlong=false;
- else
- elog(ERROR,"Unknown value: %s => %s", pcfg->key, pcfg->value);
- }
- else
- elog(ERROR,"Unknown option: %s => %s", pcfg->key, pcfg->value);
-
- pfree(pcfg->key);
- pfree(pcfg->value);
- pcfg++;
- }
- pfree(cfg);
-
- PG_RETURN_POINTER(d);
- }
-
-PG_FUNCTION_INFO_V1(dlexize_intdict);
-Datum dlexize_intdict(PG_FUNCTION_ARGS);
-Datum
-dlexize_intdict(PG_FUNCTION_ARGS)
-{
- DictInt *d = (DictInt*)PG_GETARG_POINTER(0);
- char *in = (char*)PG_GETARG_POINTER(1);
- char *txt = pnstrdup(in, PG_GETARG_INT32(2));
- TSLexeme *res = palloc(sizeof(TSLexeme) * 2);
-
- /* Your INIT dictionary code */
- res[1].lexeme = NULL;
-
- if (PG_GETARG_INT32(2) > d->maxlen)
- {
- if (d->rejectlong)
- { /* stop, return void array */
- pfree(txt);
- res[0].lexeme = NULL;
- }
- else
- { /* cut integer */
- txt[d->maxlen] = '\0';
- res[0].lexeme = txt;
- }
- }
- else
- res[0].lexeme = txt;
-
- PG_RETURN_POINTER(res);
-}
-
-
- This is the Makefile:
-
-subdir = contrib/dict_intdict
-top_builddir = ../..
-include $(top_builddir)/src/Makefile.global
-
-MODULE_big = dict_intdict
-OBJS = dict_tmpl.o
-DATA_built = dict_intdict.sql
-DOCS =
-
-include $(top_srcdir)/contrib/contrib-global.mk
-
-
-
- This is a dict_intdict.sql.in:
-
-SET default_text_search_config = 'english';
-
-BEGIN;
-
-CREATE OR REPLACE FUNCTION dinit_intdict(internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE 'C';
-
-CREATE OR REPLACE FUNCTION dlexize_intdict(internal,internal,internal,internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE 'C'
- WITH (isstrict);
-
-CREATE TEXT SEARCH TEMPLATE intdict_template (
- LEXIZE = dlexize_intdict, INIT = dinit_intdict
-);
-
-CREATE TEXT SEARCH DICTIONARY intdict (
- TEMPLATE = intdict_template,
- MAXLEN = 6, REJECTLONG = false
-);
-
-COMMENT ON TEXT SEARCH DICTIONARY intdict IS 'Dictionary for Integers';
-
-END;
-
-
-
-
-
-
-
Example of Creating a Parser
-
-
SQL command
CREATE TEXT SEARCH PARSER creates
- a parser for full text searching. In our example we will implement
- a simple parser which recognizes space-delimited words and
- has only two types (3, word, Word; 12, blank, Space symbols). Identifiers
- were chosen to keep compatibility with the default headline() function
- since we do not implement our own version.
-
-
- To implement a parser one needs to create a minimum of four functions.
-
-
-
-
-
-
-
- START = start_function
-
-
-
- Initialize the parser. Arguments are a pointer to the parsed text and its
- length.
-
- Returns a pointer to the internal structure of a parser. Note that it should
- be malloc>ed or palloc>ed in the
- TopMemoryContext>. We name it ParserState>.
-
-
-
-
-
-
-
- GETTOKEN = gettoken_function
-
-
-
- Returns the next token.
- Arguments are ParserState *, char **, int *.
-
- This procedure will be called as long as the procedure returns token type zero.
-
-
-
-
-
-
-
- END = end_function,
-
-
-
- This void function will be called after parsing is finished to free
- allocated resources in this procedure (ParserState>). The argument
- is ParserState *.
-
-
-
-
-
-
-
- LEXTYPES = lextypes_function
-
-
-
- Returns an array containing the id, alias, and the description of the tokens
- in the parser. See LexDescr in src/include/utils/ts_public.h>.
-
-
-
-
-
-
- Below is the source code of our test parser, organized as a contrib> module.
-
-
- Testing:
-
-SELECT * FROM ts_parse('testparser','That''s my first own parser');
- tokid | token
--------+--------
- 3 | That's
- 12 |
- 3 | my
- 12 |
- 3 | first
- 12 |
- 3 | own
- 12 |
- 3 | parser
-
-SELECT to_tsvector('testcfg','That''s my first own parser');
- to_tsvector
--------------------------------------------------
- 'my':2 'own':4 'first':3 'parser':5 'that''s':1
-
-SELECT ts_headline('testcfg','Supernovae stars are the brightest phenomena in galaxies', to_tsquery('testcfg', 'star'));
- headline
------------------------------------------------------------------
- Supernovae <b>stars</b> are the brightest phenomena in galaxies
-
-
-
-
- This test parser is an example adopted from a tutorial by Valli,
- url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html">parser
- HOWTO.
-
-
- To compile the example just do:
-
-$ make
-$ make install
-$ psql regression < test_parser.sql
-
-
-
- This is a test_parser.c>:
-
-
-#ifdef PG_MODULE_MAGIC
-PG_MODULE_MAGIC;
-#endif
-
-/*
- * types
- */
-
-/* self-defined type */
-typedef struct {
- char * buffer; /* text to parse */
- int len; /* length of the text in buffer */
- int pos; /* position of the parser */
-} ParserState;
-
-/* copy-paste from wparser.h of tsearch2 */
-typedef struct {
- int lexid;
- char *alias;
- char *descr;
-} LexDescr;
-
-/*
- * prototypes
- */
-PG_FUNCTION_INFO_V1(testprs_start);
-Datum testprs_start(PG_FUNCTION_ARGS);
-
-PG_FUNCTION_INFO_V1(testprs_getlexeme);
-Datum testprs_getlexeme(PG_FUNCTION_ARGS);
-
-PG_FUNCTION_INFO_V1(testprs_end);
-Datum testprs_end(PG_FUNCTION_ARGS);
-
-PG_FUNCTION_INFO_V1(testprs_lextype);
-Datum testprs_lextype(PG_FUNCTION_ARGS);
-
-/*
- * functions
- */
-Datum testprs_start(PG_FUNCTION_ARGS)
-{
- ParserState *pst = (ParserState *) palloc(sizeof(ParserState));
- pst->buffer = (char *) PG_GETARG_POINTER(0);
- pst->len = PG_GETARG_INT32(1);
- pst->pos = 0;
-
- PG_RETURN_POINTER(pst);
-}
-
-Datum testprs_getlexeme(PG_FUNCTION_ARGS)
-{
- ParserState *pst = (ParserState *) PG_GETARG_POINTER(0);
- char **t = (char **) PG_GETARG_POINTER(1);
- int *tlen = (int *) PG_GETARG_POINTER(2);
- int type;
-
- *tlen = pst->pos;
- *t = pst->buffer + pst->pos;
-
- if ((pst->buffer)[pst->pos] == ' ')
- {
- /* blank type */
- type = 12;
- /* go to the next non-white-space character */
- while ((pst->buffer)[pst->pos] == ' ' &&
- pst->pos < pst->len)
- (pst->pos)++;
- } else {
- /* word type */
- type = 3;
- /* go to the next white-space character */
- while ((pst->buffer)[pst->pos] != ' ' &&
- pst->pos < pst->len)
- (pst->pos)++;
- }
-
- *tlen = pst->pos - *tlen;
-
- /* we are finished if (*tlen == 0) */
- if (*tlen == 0)
- type=0;
-
- PG_RETURN_INT32(type);
-}
-
-Datum testprs_end(PG_FUNCTION_ARGS)
-{
- ParserState *pst = (ParserState *) PG_GETARG_POINTER(0);
- pfree(pst);
- PG_RETURN_VOID();
-}
-
-Datum testprs_lextype(PG_FUNCTION_ARGS)
-{
- /*
- Remarks:
- - we have to return the blanks for headline reason
- - we use the same lexids like Teodor in the default
- word parser; in this way we can reuse the headline
- function of the default word parser.
- */
- LexDescr *descr = (LexDescr *) palloc(sizeof(LexDescr) * (2+1));
-
- /* there are only two types in this parser */
- descr[0].lexid = 3;
- descr[0].alias = pstrdup("word");
- descr[0].descr = pstrdup("Word");
- descr[1].lexid = 12;
- descr[1].alias = pstrdup("blank");
- descr[1].descr = pstrdup("Space symbols");
- descr[2].lexid = 0;
-
- PG_RETURN_POINTER(descr);
-}
-
-
-
- This is a Makefile
-
-override CPPFLAGS := -I. $(CPPFLAGS)
-
-MODULE_big = test_parser
-OBJS = test_parser.o
-
-DATA_built = test_parser.sql
-DATA =
-DOCS = README.test_parser
-REGRESS = test_parser
-
-
-ifdef USE_PGXS
-PGXS := $(shell pg_config --pgxs)
-include $(PGXS)
-else
-subdir = contrib/test_parser
-top_builddir = ../..
-include $(top_builddir)/src/Makefile.global
-include $(top_srcdir)/contrib/contrib-global.mk
-endif
-
-
- This is a test_parser.sql.in:
-
-SET default_text_search_config = 'english';
-
-BEGIN;
-
-CREATE FUNCTION testprs_start(internal,int4)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE 'C' with (isstrict);
-
-CREATE FUNCTION testprs_getlexeme(internal,internal,internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE 'C' with (isstrict);
-
-CREATE FUNCTION testprs_end(internal)
- RETURNS void
- AS 'MODULE_PATHNAME'
- LANGUAGE 'C' with (isstrict);
-
-CREATE FUNCTION testprs_lextype(internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE 'C' with (isstrict);
-
-
-CREATE TEXT SEARCH PARSER testparser (
- START = testprs_start,
- GETTOKEN = testprs_getlexeme,
- END = testprs_end,
- LEXTYPES = testprs_lextype
-);
-
-CREATE TEXT SEARCH CONFIGURATION testcfg (PARSER = testparser);
-ALTER TEXT SEARCH CONFIGURATION testcfg ADD MAPPING FOR word WITH simple;
-
-END;
-
-
-
-
-
-