File: nlp.doc

package info (click to toggle)
swi-prolog 9.0.4%2Bdfsg-2
links: PTS, VCS
area: main
in suites: bookworm
size: 82,408 kB
sloc: ansic: 387,503; perl: 359,326; cpp: 6,613; lisp: 6,247; java: 5,540; sh: 3,147; javascript: 2,668; python: 1,900; ruby: 1,594; yacc: 845; makefile: 428; xml: 317; sed: 12; sql: 6
file content (145 lines) | stat: -rw-r--r-- 4,860 bytes
parent folder | download | duplicates (4)
\documentclass[11pt]{article}
\usepackage{times}
\usepackage{pl}
\usepackage{html}
\sloppy
\makeindex

\onefile
\htmloutput{.}					% Output directory
\htmlmainfile{nlp}				% Main document file
\bodycolor{white}				% Page colour

\begin{document}

\title{SWI-Prolog Natural Language Processing Primitives}
\author{Jan Wielemaker \\
	VU University Amsterdam \\
	The Netherlands \\
	E-mail: \email{J.Wielemaker@cs.vu.nl}}

\maketitle

\begin{abstract}
This package contains some well known basic routines for natural
language processing and information retrieval.  The current version
of this package is very limited, which makes the name somewhat
misleading.  Suggestions and contributions are welcome.
\end{abstract}

\pagebreak
\tableofcontents

\vfill
\vfill

\newpage

\section{Double Metaphone -- Phonetic string matching}
\label{sec:double-metaphone}

The library \pllib{double_metaphone} implements the \emph{Double
Metaphone} algorithm developed by Lawrence Philips and described in
``The Double-Metaphone Search Algorithm'' by L Philips, C/C++ User's
Journal, 2000. Double Metaphone creates a key from a word that
represents its phonetic properties. Two words with the same Double
Metaphone are supposed to sound similar. The Double Metaphone algorithm
is an improved version of the \emph{Soundex} algorithm.

\begin{description}
    \predicate{double_metaphone}{2}{+In, -MetaPhone}
Same as double_metaphone/3, but only returning the primary metaphone.
    \predicate{double_metaphone}{3}{+In, -MetaPhone, -AltMetaphone}
Create metaphone and alternative metaphone from \arg{In}.  The primary
metaphone is based on english, while the secondary deals with common
alternative pronounciation in other languages.  \arg{In} is either
and atom, string object, code- or character list.  The metaphones
are always returned as atoms.
\end{description}


\subsection{Origin and Copyright}
\label{sec:double-metaphone-copyright}

The Double Metaphone algorithm is copied from the Perl library that
holds the following copyright notice.  To the best of our knowledge
the Perl license is compatible to the SWI-Prolog license schema and
therefore including this module poses no additional license conditions.

\begin{quote}
Copyright 2000, Maurice Aubrey <maurice@hevanet.com>.
All rights reserved.

This code is based heavily on the C++ implementation by Lawrence Philips
and incorporates several bug fixes courtesy of Kevin Atkinson
<kevina@users.sourceforge.net>.

This module is free software; you may redistribute it and/or modify it
under the same terms as Perl itself.
\end{quote}

\section{Porter Stem -- Determine stem and related routines}
\label{sec:porter-stem}

The \pllib{porter_stem} library implements the stemming algorithm
described by Porter in Porter, 1980, ``An algorithm for suffix
stripping'', Program, Vol. 14, no. 3, pp 130-137. The library comes with
some additional predicates that are commonly used in the context of
stemming.

\begin{description}
    \predicate{porter_stem}{2}{+In, -Stem}
Determine the stem of \arg{In}. \arg{In} must represent ISO Latin-1
text. The porter_stem/2 predicate first maps \arg{In} to lower case,
then removes all accents as in unaccent_atom/2 and finally applies the
Porter stem algorithm.

    \predicate{unaccent_atom}{2}{+In, -ASCII}
If \arg{In} is general ISO Latin-1 text with accents, \arg{ASCII} is
unified with a plain ASCII version of the string.  Note that the current
version only deals with ISO Latin-1 atoms.

    \predicate{tokenize_atom}{2}{+In, -TokenList}
Break the text \arg{In} into words, numbers and punctuation characters.
Tokens are created to the following rules:

\begin{tabular}{ll}
\verb$[-+][0-9]+(\.[0-9]+)?([eE][-+][0-9]+)?$	& number \\
\verb$[:alpha:][:alnum:]+$			& word \\
\verb$[:space:]+$				& skipped \\
anything else					& single-character
\end{tabular}

Character classification is based on the C-library iswalnum() etc.\
functions. Recognised numbers are passed to Prolog read/1, supporting
unbounded integers.

It is likely that future versions of this library will provide
tokenize_atom/3 with additional options to modify space handling as
well as the definition of words.

    \predicate{atom_to_stem_list}{2}{+In, -ListOfStems}
Combines the three above routines, returning a list holding an atom
with the stem of each word encountered and numbers for encountered
numbers.
\end{description}


\subsection{Origin and Copyright}
\label{sec:porter-stem-copyright}

The code is based on the original Public Domain implementation by Martin
Porter as can be found at
\url{http://www.tartarus.org/martin/PorterStemmer/}. The code has been
modified by Jan Wielemaker. He removed all global variables to make the
code thread-safe, added the unaccent and tokenize code and created the
SWI-Prolog binding.

\input{snowball.tex}

\input{isub.tex}

\printindex

\end{document}