MySQL supports server-side full-text parser plugins only with
MyISAM. For introductory
information about full-text parser plugins, see
Full-Text Parser Plugins.
A full-text parser plugin can be used to replace or modify the
built-in full-text parser. This section describes how to write
a full-text parser plugin named
simple_parser. This plugin performs parsing
based on simpler rules than those used by the MySQL built-in
full-text parser: Words are nonempty runs of whitespace
characters.
The instructions use the source code in the
plugin/fulltext directory of MySQL source
distributions, so change location into that directory. The
following procedure describes how the plugin library is
created:
To write a full-text parser plugin, include the following header file in the plugin source file. Other MySQL or general header files might also be needed, depending on the plugin capabilities and requirements.
#include <mysql/plugin.h>
plugin.h defines the
MYSQL_FTPARSER_PLUGIN server plugin
type and the data structures needed to declare the plugin.
Set up the library descriptor for the plugin library file.
This descriptor contains the general plugin descriptor for
the server plugin. For a full-text parser plugin, the type
must be MYSQL_FTPARSER_PLUGIN. This is
the value that identifies the plugin as being legal for
use in a WITH PARSER clause when
creating a FULLTEXT index. (No other
plugin type is legal for this clause.)
For example, the library descriptor for a library that
contains a single full-text parser plugin named
simple_parser looks like this:
mysql_declare_plugin(ftexample)
{
MYSQL_FTPARSER_PLUGIN, /* type */
&simple_parser_descriptor, /* descriptor */
"simple_parser", /* name */
"Oracle Corporation", /* author */
"Simple Full-Text Parser", /* description */
PLUGIN_LICENSE_GPL, /* plugin license */
simple_parser_plugin_init, /* init function (when loaded) */
simple_parser_plugin_deinit,/* deinit function (when unloaded) */
0x0001, /* version */
simple_status, /* status variables */
simple_system_variables, /* system variables */
NULL,
0
}
mysql_declare_plugin_end;
The name member
(simple_parser) indicates the name to
use for references to the plugin in statements such as
INSTALL PLUGIN or
UNINSTALL PLUGIN. This is
also the name displayed by SHOW
PLUGINS or
INFORMATION_SCHEMA.PLUGINS.
For more information, see Section 24.2.4.2.1, “Server Plugin Library and Plugin Descriptors”.
Set up the type-specific plugin descriptor.
Each general plugin descriptor in the library descriptor
points to a type-specific descriptor. For a full-text
parser plugin, the type-specific descriptor is an instance
of the st_mysql_ftparser structure in
the plugin.h file:
struct st_mysql_ftparser
{
int interface_version;
int (*parse)(MYSQL_FTPARSER_PARAM *param);
int (*init)(MYSQL_FTPARSER_PARAM *param);
int (*deinit)(MYSQL_FTPARSER_PARAM *param);
};
As shown by the structure definition, the descriptor has an interface version number and contains pointers to three functions.
The interface version number is specified using a symbol,
which is in the form:
MYSQL_.
For full-text parser plugins, the symbol is
xxx_INTERFACE_VERSIONMYSQL_FTPARSER_INTERFACE_VERSION. In
the source code, you will find the actual interface
version number for the full-text parser plugin defined in
include/mysql/plugin_ftparser.h.
The init and deinit
members should point to a function or be set to 0 if the
function is not needed. The parse
member must point to the function that performs the
parsing.
In the simple_parser declaration, that
descriptor is indicated by
&simple_parser_descriptor. The
descriptor specifies the version number for the full-text
plugin interface (as given by
MYSQL_FTPARSER_INTERFACE_VERSION), and
the plugin's parsing, initialization, and deinitialization
functions:
static struct st_mysql_ftparser simple_parser_descriptor=
{
MYSQL_FTPARSER_INTERFACE_VERSION, /* interface version */
simple_parser_parse, /* parsing function */
simple_parser_init, /* parser init function */
simple_parser_deinit /* parser deinit function */
};
A full-text parser plugin is used in two different contexts, indexing and searching. In both contexts, the server calls the initialization and deinitialization functions at the beginning and end of processing each SQL statement that causes the plugin to be invoked. However, during statement processing, the server calls the main parsing function in context-specific fashion:
For indexing, the server calls the parser for each column value to be indexed.
For searching, the server calls the parser to parse
the search string. The parser might also be called for
rows processed by the statement. In natural language
mode, there is no need for the server to call the
parser. For boolean mode phrase searches or natural
language searches with query expansion, the parser is
used to parse column values for information that is
not in the index. Also, if a boolean mode search is
done for a column that has no
FULLTEXT index, the built-in parser
will be called. (Plugins are associated with specific
indexes. If there is no index, no plugin is used.)
The plugin declaration in the general plugin descriptor
has init and deinit
members that point initialization and deinitialization
functions, and so does the type-specific plugin descriptor
to which it points. However, these pairs of functions have
different purposes and are invoked for different reasons:
For the plugin declaration in the general plugin descriptor, the initialization and deinitialization functions are invoked when the plugin is loaded and unloaded.
For the type-specific plugin descriptor, the initialization and deinitialization functions are invoked per SQL statement for which the plugin is used.
Each interface function named in the plugin descriptor
should return zero for success or nonzero for failure, and
each of them receives an argument that points to a
MYSQL_FTPARSER_PARAM structure
containing the parsing context. The structure has this
definition:
typedef struct st_mysql_ftparser_param
{
int (*mysql_parse)(struct st_mysql_ftparser_param *,
char *doc, int doc_len);
int (*mysql_add_word)(struct st_mysql_ftparser_param *,
char *word, int word_len,
MYSQL_FTPARSER_BOOLEAN_INFO *boolean_info);
void *ftparser_state;
void *mysql_ftparam;
struct charset_info_st *cs;
char *doc;
int length;
int flags;
enum enum_ftparser_mode mode;
} MYSQL_FTPARSER_PARAM;
The structure members are used as follows:
mysql_parse: A pointer to a
callback function that invokes the server's built-in
parser. Use this callback when the plugin acts as a
front end to the built-in parser. That is, when the
plugin parsing function is called, it should process
the input to extract the text and pass the text to the
mysql_parse callback.
The first parameter for this callback function should
be the param value itself:
param->mysql_parse(param, ...);
A front end plugin can extract text and pass it all at once to the built-in parser, or it can extract and pass text to the built-in parser a piece at a time. However, in this case, the built-in parser treats the pieces of text as though there are implicit word breaks between them.
mysql_add_word: A pointer to a
callback function that adds a word to a full-text
index or to the list of search terms. Use this
callback when the parser plugin replaces the built-in
parser. That is, when the plugin parsing function is
called, it should parse the input into words and
invoke the mysql_add_word callback
for each word.
The first parameter for this callback function should
be the param value itself:
param->mysql_add_word(param, ...);
ftparser_state: This is a generic
pointer. The plugin can set it to point to information
to be used internally for its own purposes.
mysql_ftparam: This is set by the
server. It is passed as the first argument to the
mysql_parse or
mysql_add_word callback.
cs: A pointer to information about
the character set of the text, or 0 if no information
is available.
doc: A pointer to the text to be
parsed.
length: The length of the text to
be parsed, in bytes.
flags: Parser flags. This is zero
if there are no special flags. The only nonzero flag
is MYSQL_FTFLAGS_NEED_COPY, which
means that mysql_add_word() must
save a copy of the word (that is, it cannot use a
pointer to the word because the word is in a buffer
that will be overwritten.) This member was added in
MySQL 5.1.12.
This flag might be set or reset by MySQL before
calling the parser plugin, by the parser plugin
itself, or by the mysql_parse()
function.
mode: The parsing mode. This value
will be one of the following constants:
MYSQL_FTPARSER_SIMPLE_MODE:
Parse in fast and simple mode, which is used for
indexing and for natural language queries. The
parser should pass to the server only those words
that should be indexed. If the parser uses length
limits or a stopword list to determine which words
to ignore, it should not pass such words to the
server.
MYSQL_FTPARSER_WITH_STOPWORDS:
Parse in stopword mode. This is used in boolean
searches for phrase matching. The parser should
pass all words to the server, even stopwords or
words that are outside any normal length limits.
MYSQL_FTPARSER_FULL_BOOLEAN_INFO:
Parse in boolean mode. This is used for parsing
boolean query strings. The parser should recognize
not only words but also boolean-mode operators and
pass them to the server as tokens using the
mysql_add_word callback. To
tell the server what kind of token is being
passed, the plugin needs to fill in a
MYSQL_FTPARSER_BOOLEAN_INFO
structure and pass a pointer to it.
If the parser is called in boolean mode, the
param->mode value will be
MYSQL_FTPARSER_FULL_BOOLEAN_INFO. The
MYSQL_FTPARSER_BOOLEAN_INFO structure
that the parser uses for passing token information to the
server looks like this:
typedef struct st_mysql_ftparser_boolean_info
{
enum enum_ft_token_type type;
int yesno;
int weight_adjust;
char wasign;
char trunc;
/* These are parser state and must be removed. */
char prev;
char *quot;
} MYSQL_FTPARSER_BOOLEAN_INFO;
The parser should fill in the structure members as follows:
type: The token type. The following
table shows the permissible types.
Table 24.3 Full-Text Parser Token Types
| Token Value | Meaning |
|---|---|
FT_TOKEN_EOF | End of data |
FT_TOKEN_WORD | A regular word |
FT_TOKEN_LEFT_PAREN | The beginning of a group or subexpression |
FT_TOKEN_RIGHT_PAREN | The end of a group or subexpression |
FT_TOKEN_STOPWORD | A stopword |
yesno: Whether the word must be
present for a match to occur. 0 means that the word is
optional but increases the match relevance if it is
present. Values larger than 0 mean that the word must
be present. Values smaller than 0 mean that the word
must not be present.
weight_adjust: A weighting factor
that determines how much a match for the word counts.
It can be used to increase or decrease the word's
importance in relevance calculations. A value of zero
indicates no weight adjustment. Values greater than or
less than zero mean higher or lower weight,
respectively. The examples at
Section 12.9.2, “Boolean Full-Text Searches”, that use the
< and >
operators illustrate how weighting works.
wasign: The sign of the weighting
factor. A negative value acts like the
~ boolean-search operator, which
causes the word's contribution to the relevance to be
negative.
trunc: Whether matching should be
done as if the boolean-mode *
truncation operator had been given.
Plugins should not use the prev and
quot members of the
MYSQL_FTPARSER_BOOLEAN_INFO structure.
The plugin parser framework does not support:
The @distance boolean operator.
A leading plus sign (+) or minus
sign (-) boolean operator
followed by a space and then a word ('+
apple' or '- apple').
The leading plus or minus sign must be directly
adjacent to the word, for example:
'+apple' or
'-apple'.
For information about boolean full-text search operators, see Section 12.9.2, “Boolean Full-Text Searches”.
Set up the plugin interface functions.
The general plugin descriptor in the library descriptor
names the initialization and deinitialization functions
that the server should invoke when it loads and unloads
the plugin. For simple_parser, these
functions do nothing but return zero to indicate that they
succeeded:
static int simple_parser_plugin_init(void *arg __attribute__((unused)))
{
return(0);
}
static int simple_parser_plugin_deinit(void *arg __attribute__((unused)))
{
return(0);
}
Because those functions do not actually do anything, you could omit them and specify 0 for each of them in the plugin declaration.
The type-specific plugin descriptor for
simple_parser names the initialization,
deinitialization, and parsing functions that the server
invokes when the plugin is used. For
simple_parser, the initialization and
deinitialization functions do nothing:
static int simple_parser_init(MYSQL_FTPARSER_PARAM *param
__attribute__((unused)))
{
return(0);
}
static int simple_parser_deinit(MYSQL_FTPARSER_PARAM *param
__attribute__((unused)))
{
return(0);
}
Here too, because those functions do nothing, you could omit them and specify 0 for each of them in the plugin descriptor.
The main parsing function,
simple_parser_parse(), acts as a
replacement for the built-in full-text parser, so it needs
to split text into words and pass each word to the server.
The parsing function's first argument is a pointer to a
structure that contains the parsing context. This
structure has a doc member that points
to the text to be parsed, and a length
member that indicates how long the text is. The simple
parsing done by the plugin considers nonempty runs of
whitespace characters to be words, so it identifies words
like this:
static int simple_parser_parse(MYSQL_FTPARSER_PARAM *param)
{
char *end, *start, *docend= param->doc + param->length;
for (end= start= param->doc;; end++)
{
if (end == docend)
{
if (end > start)
add_word(param, start, end - start);
break;
}
else if (isspace(*end))
{
if (end > start)
add_word(param, start, end - start);
start= end + 1;
}
}
return(0);
}
As the parser finds each word, it invokes a function
add_word() to pass the word to the
server. add_word() is a helper function
only; it is not part of the plugin interface. The parser
passes the parsing context pointer to
add_word(), as well as a pointer to the
word and a length value:
static void add_word(MYSQL_FTPARSER_PARAM *param, char *word, size_t len)
{
MYSQL_FTPARSER_BOOLEAN_INFO bool_info=
{ FT_TOKEN_WORD, 0, 0, 0, 0, ' ', 0 };
param->mysql_add_word(param, word, len, &bool_info);
}
For boolean-mode parsing, add_word()
fills in the members of the bool_info
structure as described earlier in the discussion of the
st_mysql_ftparser_boolean_info
structure.
Set up the status variables. For the
simple_parser plugin, the following
status variable array sets up one status variable with a
value that is static text, and another with a value that
is stored in a long integer variable:
long number_of_calls= 0;
struct st_mysql_show_var simple_status[]=
{
{"static", (char *)"just a static text", SHOW_CHAR},
{"called", (char *)&number_of_calls, SHOW_LONG},
{0,0,0}
};
When the plugin is installed, the plugin name and the
name value are joined with an
underscore to form the name displayed by
SHOW STATUS. For the array
just shown, the resulting status variable names are
simple_parser_static and
simple_parser_called. This convention
means that you can easily display the variables for a
plugin using its name:
mysql> SHOW STATUS LIKE 'simple_parser%';
+----------------------+--------------------+
| Variable_name | Value |
+----------------------+--------------------+
| simple_parser_static | just a static text |
| simple_parser_called | 0 |
+----------------------+--------------------+
To compile and install a plugin library file, use the
instructions in
Section 24.2.4.3, “Compiling and Installing Plugin Libraries”. To make the
library file available for use, install it in the plugin
directory (the directory named by the
plugin_dir system
variable). For the simple_parser
plugin, it is compiled and installed when you build MySQL
from source. It is also included in binary distributions.
The build process produces a shared object library with a
name of mypluglib.so (the
.so suffix might differ depending on
your platform).
To use the plugin, register it with the server. For
example, to register the plugin at runtime, use this
statement (adjust the .so suffix for
your platform as necessary):
INSTALL PLUGIN simple_parser SONAME 'mypluglib.so';
For additional information about plugin loading, see Section 5.5.2, “Installing and Uninstalling Plugins”.
To verify plugin installation, examine the
INFORMATION_SCHEMA.PLUGINS
table or use the SHOW
PLUGINS statement. See
Section 5.5.3, “Obtaining Server Plugin Information”.
Test the plugin to verify that it works properly.
Create a table that contains a string column and associate
the parser plugin with a FULLTEXT index
on the column:
mysql>CREATE TABLE t (c VARCHAR(255),->FULLTEXT (c) WITH PARSER simple_parser->) ENGINE=MyISAM;Query OK, 0 rows affected (0.01 sec)
Insert some text into the table and try some searches. These should verify that the parser plugin treats all nonwhitespace characters as word characters:
mysql>INSERT INTO t VALUES->('latin1_general_cs is a case-sensitive collation'),->('I\'d like a case of oranges'),->('this is sensitive information'),->('another row'),->('yet another row');Query OK, 5 rows affected (0.02 sec) Records: 5 Duplicates: 0 Warnings: 0 mysql>SELECT c FROM t;+-------------------------------------------------+ | c | +-------------------------------------------------+ | latin1_general_cs is a case-sensitive collation | | I'd like a case of oranges | | this is sensitive information | | another row | | yet another row | +-------------------------------------------------+ 5 rows in set (0.00 sec) mysql>SELECT MATCH(c) AGAINST('case') FROM t;+--------------------------+ | MATCH(c) AGAINST('case') | +--------------------------+ | 0 | | 1.2968142032623 | | 0 | | 0 | | 0 | +--------------------------+ 5 rows in set (0.00 sec) mysql>SELECT MATCH(c) AGAINST('sensitive') FROM t;+-------------------------------+ | MATCH(c) AGAINST('sensitive') | +-------------------------------+ | 0 | | 0 | | 1.3253291845322 | | 0 | | 0 | +-------------------------------+ 5 rows in set (0.01 sec) mysql>SELECT MATCH(c) AGAINST('case-sensitive') FROM t;+------------------------------------+ | MATCH(c) AGAINST('case-sensitive') | +------------------------------------+ | 1.3109166622162 | | 0 | | 0 | | 0 | | 0 | +------------------------------------+ 5 rows in set (0.01 sec) mysql>SELECT MATCH(c) AGAINST('I\'d') FROM t;+--------------------------+ | MATCH(c) AGAINST('I\'d') | +--------------------------+ | 0 | | 1.2968142032623 | | 0 | | 0 | | 0 | +--------------------------+ 5 rows in set (0.01 sec)
Neither “case” nor “insensitive” match “case-insensitive” the way that they would for the built-in parser.