1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591
|
Pinot
Copyright 2005-2025 Fabrice Colin <fabrice dot colin at gmail dot com>
Homepage - https://github.com/FabriceColin/pinot
previously hosted at http://code.google.com/p/pinot-search/
and http://pinot.berlios.de/
Translations - https://translations.launchpad.net/pinot/trunk/+pots/pinot
1. What is Pinot
2. Available engines
3. Indexes
4. Indexing and monitoring
5. Searching
6. Viewing cached results
7. File formats
8. File patterns
9. Digging deeper
10. Saving results
11. D-Bus service & daemon
12. CJKV support
13. Environment variables and aliases
14. How to reset indexes
15. Compiling
1. What is Pinot
Pinot combines desktop search and metasearch. It consists of :
* a D-Bus service daemon that crawls, indexes, monitors your documents
and that plugs into the GNOME Shell search system ("pinot-dbus-daemon")
* a GTK3-based user interface that enables to query the index built by
the service as well as Web engines, and which can display and analyze
the results ("pinot")
* other command-line tools
It was developed and tested on GNU/Linux and should work on other Unix-like
systems.
2. Available engines
One of the main functionalities of Pinot is metasearch. This lets you query
a variety of sources, including Web-based search engines. By default, the
list of available engines is hidden and defaults to internal indexes (see
section "3. Indexes"). To show the list of engines, click on the Show All
Search Engines button, next to the Query field immediately below the menu
bar. Click on the same button again to hide the list.
Any number of engine or engine group may be selected at any one time.
Multi-selection is done like in any other application. All queries are always
run against the list of currently selected engines.
Pinot supports both Sherlock and OpenSearch Description plugins. They are
installed in $PREFIX/share/pinot/engines/, where PREFIX is usually /usr.
Additional engines can be installed in that directory or in ~/.pinot/engines.
Note this directory is not created automatically.
Sherlock is what Firefox and the Mozilla Suite use. Chances are that somebody
wrote a plugin for the engine you are interested in. Beware that a lot are
out of date and will require some changes. Use pinot-search on the
command-line to run a quick check on a plugin, eg
$ pinot-search sherlock $PREFIX/share/pinot/engines/Bozo.src "clowns"
Plugins are categorized by channels. For Sherlock plugins, the routeType
element under SEARCH specifies the name of the channel the plugin belongs to.
As for OpenSearch, Pinot should work with OpenSearch Description 1.0 and 1.1
(draft 2) plugins. Keep in mind that the spec doesn't describe how to parse
the results pages returned by search engines, therefore Pinot assumes that
engines return results formatted according to the OpenSearch Response
standard.
In practice, this means that plugins that don't stick to the following rules
will be ignored or won't show any result :
* For Description 1.1 plugins, the type attribute on the Url field must be
set to "application/atom+xml" or "application/rss+xml" (default).
"text/html" will be rejected.
* The search engine's results page content type must be some form of XML,
otherwise Pinot won't attempt parsing it.
Pinot differs from the Description spec in that it interprets the Tags field
as a channel name. The standard defines Tags as a "space-delimited set of
words that are used as keywords to identify and categorize this search
content".
The "Xapian Omega" plugin allows to query a locally installed instance of
Xapian Omega at http://localhost/. If Omega is installed elsewhere, edit
$PREFIX/share/pinot/engines/OmegaDescription.xml.
3. Indexes
Pinot has two internal indexes. My Documents is populated by the D-Bus
service and contains documents found on your computer. My Web Pages is
populated by the UI whenever you :
* import an external document, using the Index, Import URL menu
* index results returned by Web engines, using the Results, Index menu
or through a Stored Query
Both index may have any of the file types listed in section "7. File formats".
Indexes built by any other Xapian-based tools can be added to Pinot. To add
an external index, click the + button at the bottom of the engines list.
It can either be local, in which case you will have to select the directory
where it is found, or served from a remote machine by xapian-tcpsrv. See
the manual page for xapian-tcpsrv(1).
All indexes are grouped together under the channel Current User in the
engines list.
4. Indexing and monitoring
Pinot can index any directory configured under the Indexing tab of the
Preferences box. Monitoring is optional and should be disabled for the
directories whose contents seldom change, eg $PREFIX/share/doc.
Indexing and monitoring of directories is handled by the D-Bus service.
The number of files and directories that can be monitored is capped by
the value of /proc/sys/fs/inotify/max_user_watches - 1024.
Symlinks are not followed but are still indexed, with the MIME type
"inode/symlink".
While Pinot is not currently able to get to and index application-specific
data held in dot-directories, it can index common file formats as listed
in section "7. File formats".
All files and directories with a name that starts with a dot, eg
".thunderbird", are skipped and their content is not indexed. If you wish
to include the contents of some dot-directory, create a symlink to a
directory that is configured in Preferences. For instance, if "~/Documents"
is configured for indexing, create a symlink from "~/.thunderbird" to
"~/Documents/TMail". For this to work, the dot-directory must not be in a
directory configured for indexing.
If you want to exclude any specific files or directories from indexing, use
patterns as described in section "8. File patterns".
Pinot supports stopwords removal. While no such list is provided by default,
they can be easily found on the Internet. Each language has its own stopword
list, for instance a stopwords list for English should be copied to
$PREFIX/share/pinot/stopwords/stopwords.en
Language detection is done with libexttextcat. Ensure that the paths listed
in /etc/pinot/textcat_conf.txt are correct.
The pinot-index program allows indexing and peeking at documents' properties
from the command-line. Using the -i/--index option with the My Documents or
My Web Pages index is not recommended. For more details, see the manual page
for pinot-index(1).
5. Searching
Searches are run differently based on the type of engine being queried.
When querying a Web engine, Pinot assumes this engine understands the query,
which is sent as is. No pre-processing is performed on the text of the query,
and the results list is more or less presented as retrieved from the Web
engine.
When querying an index, things are somewhat different. Queries can be
expressed in a very natural way, using a combination of operators, filters
and ranges. This query syntax is the syntax supported natively by Xapian's
QueryParser and is documented at http://www.xapian.org/docs/queryparser.html
For instance, the query "type:text/html AND lang:en AND (tcp NEAR ip)" will
look for HTML files in English that mention TCP/IP. Note that all operators
should be specified in capitals, eg "AND" not "and". The latter will be
treated as a regular term.
Pinot supports these query filters :
"site" for host name, eg "site:github.com"
"file" for file name, eg "file:index.html"
"ext" for file extension, eg "ext:html"
"title" for title, eg "title:pinot"
"url" for URL, eg "url:https://github.com/"
"dir" for directory, eg "dir:/home/fabrice"
"inurl" for documents embedded in a URL, eg "inurl:file:///home/fabrice/Documents/backup.tar.gz"
"lang" for ISO language code, eg "lang:en"
"type" for MIME type, eg "type:text/html"
"class" for MIME type classification, eg "class:text"
"label" for label, eg "label:Important"
The directory filter is recursive, ie it applies to sub-directories.
Allowed language codes are "da", "nl", "en", "fi", "fr", "de", "hu", "it",
"nn", "pt", "ro", "ru", "es", "sv" and "tr".
Stemming is available to stored queries for which a stemming language is
defined. If such a query doesn't return any exact match, the query terms are
stemmed and the query is run again. Stopwords are also then removed if a
stopwords list was found for the stemming language.
The values of "file", "url", "dir" and "label" may be double-quoted. It's also
worth pointing out that the query "dir:/X/Y" will return files and directories
located in /X/Y, but not Y itself, which is what "dir:/X file:Y" would do.
In addition, these ranges are supported :
"YYYYMMDD..YYYYMMDD" for date ranges, eg "20070801..20070831"
"HHMMSS..HHMMSS" for time ranges, eg "090000..180000"
"size0..size1b" for size in bytes, eg "0..10240b"
See the manual page for pinot-search(1) for examples.
6. Viewing cached results
Results returned by search engines can be viewed "live" by selecting the View
menuitem under Results. This opens whatever application defined for the
result's MIME type and/or protocol scheme.
In addition, Pinot allows to view the page as cached by Google and the Wayback
Machine. Cache providers are actually configured in globalconfig.xml, located
in /etc/pinot/. For instance :
<cache>
<name>Google</name>
<location>http://www.google.com/search?q=cache:%url0</location>
<protocols>http, https</protocols>
</cache>
This is self-explanatory :-) Here it configures a cache provider called
"Google" that handles both http and https. The location field supports
two parameters that are substituted to obtain the URL to open :
* %url is the result's URL as displayed by the UI, eg
https://github.com/FabriceColin/pinot
* %url0 is the result's URL without the protocol, eg
github.com/FabriceColin/pinot
7. File formats
The following document types are supported internally :
* plain text
* HTML
* XML
* mbox, including attachments and embedded documents
* MP3, Ogg Vorbis, FLAC
* JPEG
* common archive formats (tar, Z, gz, bzip2, deb)
* ISO 9660 images
The following document types are supported through external programs :
* PDF (pdftotext required)
* RTF (unrtf required)
* ReStructured Text (rst2html required)
* OpenDocument/StarOffice files (unzip required)
* MS Word (antiword required)
* PowerPoint (catppt required)
* Excel (xls2csv required)
* DVI (catdvi required)
* DjVu (djvutext required)
* RPM (rpm required)
For other document types, Pinot will only index metadata such as name,
location etc... If you wish to add support for another document type, and
know of a command-line program that can handle that type, add it to
external-filters.xml, located in /etc/pinot/.
8. File patterns
It is possible to skip indexing of files that match glob(3) patterns.
These patterns are configured in the Indexing tab of the Preferences box,
and can be used as a blacklist or a whitelist.
Patterns apply to files and directories. For instance, blacklisting
"*/Desktop*" will skip "~/Desktop" and not crawl nor monitor this directory's
contents. Similarly, a blacklist entry for "*.avi" means that Pinot will not
attempt indexing the content of AVI files, and will ignore all monitor events
related to these files.
If you have never run Pinot before, the list will be pre-configured to skip
some picture, video and archive file types such as GIF, MPG and RAR.
9. Digging deeper
Pinot offers two ways you can dig deeper in your documents : More Like This
suggests terms specific to documents that may help in finding related
documents, and Search This For allows to search in results.
Both features are enabled if one or more of the results currently selected
is indexed, and only operate on those.
When activated, More Like This will create a new Stored Query prefixed with
"More Like". For instance, if you run a Stored Query with name "Me", the
expanded query's name will be "More Like Me".
Search For This will search those results for the Stored Query selected in
the sub-menu and will present results in a new tab. For instance, running
the Stored Query "Me" on a set of results will open a "Me In Results" tab.
In addition to these, Pinot may suggest alternative spellings for queries
that don't return any result. If it does, a new Stored Query prefixed with
"Corrected" will be created.
10. Saving results
Lists of results can be saved to disk by selecting the Save As menuitem
under Results. Two output formats are available to choose from in the file
selector opened by Save As :
* CSV, a text format
The semi-colon character (';') is used to delimit fields.
* OpenSearch response, a XML/RSS format
See https://en.wikipedia.org/wiki/OpenSearch for details.
11. D-Bus service & daemon
Unless Pinot was built without support for D-Bus, the daemon program
"pinot-dbus-daemon" implements the D-Bus service and should be
auto-started through the desktop file installed at
/etc/xdg/autostart/pinot-dbus-daemon.desktop.
D-Bus activation makes sure the service is running whenever one of its
methods is invoked by any consumer application. For instance, clicking
OK on the Preferences box will call the service's Reload method, which
should start the service. This method also causes the service to reload
the configuration file.
A few things to keep in mind :
* when starting, the service will first crawl all configured locations
and (re)index new and modified files. The daemon's scheduling priority
is set very low (15, can be adjusted with --priority) so that it
hopefully doesn't prevent other activities. Crawling is suspended
while the system is on battery.
* when finished crawling, the service will monitor some locations for
changes (as per preferences) and should consume little resources, unless
a huge quantity of files needs its attention.
* any change detected by the monitor is queued and acted upon as soon as
possible, eg reindex a file that was modified.
* operations that involve communicating with the service, such as editing
documents metadata, may timeout if the system is under heavy load and/or
the daemon is busy. In most cases, the message will have been received
by the daemon, but the reply may take longer than expected. The Pinot
UI may report that the operation failed, even though it was queued for
processing and will be acted upon by the daemon.
See section "13. Environment variables and aliases" for some tips on how to
query the D-Bus interface. A list of available D-Bus methods can be found
in the file pinot-dbus-daemon.xml.
Pinot v1.20 implements the GNOME Shell search provider interface to allow
searching the contents of files the daemon found at locations it crawled,
basically the My Documents index. Go to the GNOME Settings' Search screen
to enable Pinot as a provider. For this to work, the file
com.github.fabricecolin.Pinot.search-provider.ini should be in the folder
$PREFIX/share/gnome-shell/search-providers/
12. CJKV support
Pinot supports indexing and searching CJKV text.
At search time, queries that include CJKV characters are processed in a manner
compatible with the CJKV indexing scheme. There is no need to format the query
in a specific format, ie no need to separate characters with spaces.
For example, the query :
Fabrice 你好 title:身体好吗
will be modified internally to :
Fabrice (你 你好 好) title:身 title:身体 title:体 title:体好 title:好 title:好吗 title:吗
It is recommended that filters (eg "title") be used at the end of the query
for it to be processed as expected.
You can get a list of documents in which CJKV characters were detected
by the indexer with the special filter "tokens:CJKV".
13. Environment variables and aliases
Pinot tries to provide reasonable defaults for most systems, but there may be
situations where you want to tweak these values through environment variables :
* PINOT_SPELLING_DB
By default, Pinot builds indexes with a spelling database. This spelling
database may make up as much as a third of the size of the index.
If your system is low on disk space, you can disable this with
$ export PINOT_SPELLING_DB=NO
Make sure this is set for your login session, ie whenever the daemon is
auto-started. You will also have to reset indexes, as described in
section "16. How to reset indexes".
* PINOT_MINIMUM_DISK_SPACE
The daemon will stop crawling and indexing files when the partition on
which the index resides runs out of free space. By default, this means
less than 50 Mb. To change this value to 100 Mb for instance, use
$ export PINOT_MINIMUM_DISK_SPACE=100
* PINOT_MAXIMUM_INDEX_THREADS
This sets the maximum number of concurrent indexing threads used by the
daemon. The default value is 1.
* PINOT_MAXIMUM_NESTED_SIZE
This limits the extraction of documents nested inside others, such as
archives or mail messages, based on their size. By default, this is
deactivated and set to 0.
* PINOT_MAXIMUM_QUERY_RESULTS
This overrides the number of results returned by queries run through
the UI's Query field as well as the number of results initially set
for new stored queries.
Another environment variable that you may want to tweak comes from Xapian.
XAPIAN_FLUSH_THRESHOLD can be set to the number of documents after which
Xapian is to flush changes to the index. The default value is set to 10000
at the time of writing this.
Lowering this value should decrease the amount of memory used to cache
changes to the index.
Pinot provides a "tagged cd" script that enables to change a shell's
current directory to the directory that matches the path elements passed
as parameter. For instance, after setting :
$ alias pcd='. $PREFIX/share/pinot/pinot-cd.sh'
if ~/Documents is configured for indexing in Preferences, the following
command would change the current directory to ~/Documents/Web/Stats :
$ pcd Documents Stats
If other directories match the given paths, pinot-cd.sh will display a list
of matches. Future work will focus on disambiguation.
If you have dbus-send installed, you may also want to set the following
aliases :
$ alias pinot-stats='dbus-send --session --print-reply --type=method_call \
--dest=com.github.fabricecolin.Pinot /com/github/fabricecolin/Pinot com.github.fabricecolin.Pinot.GetStatistics'
$ alias pinot-stop='dbus-send --session --print-reply --type=method_call \
--dest=com.github.fabricecolin.Pinot /com/github/fabricecolin/Pinot com.github.fabricecolin.Pinot.Stop'
The first will start the service daemon by calling its GetStatistics method,
while the second alias will send it a request to stop and exit.
14. How to reset indexes
You may wish to reset one of the index and start from scratch. There
are several ways to do this, depending on which index it is.
If you want to reset My Web Pages, you can either :
* use Pinot to unindex every single document by selecting them all
and choosing Unindex in the Index menu
* or stop Pinot and delete ~/.pinot/index recursively
If you want to reset My Documents, special considerations apply because
of the historical data maintained by the daemon. There are two ways to
proceed, and both require that the daemon be stopped.
The manual way is to delete the index with
$ rm -rf ~/.pinot/daemon
and remove historical data with
$ sqlite3 ~/.pinot/history-daemon "delete from CrawlHistory; delete from CrawlSources; delete from ActionQueue;"
If you want to start from scratch and drop metadata (eg labels) that may
exist on some documents, remove the history file altogether with
$ rm -f ~/.pinot/history-daemon
The automated way is to tell the daemon to reindex everything by launching
it with the "--reindex" option, ie
$ pinot-dbus-daemon --reindex
It may be useful to take a look at the log file located at
~/.pinot/pinot-dbus-daemon.log.
15. Compiling
Pinot's configure understands the following optional switches.
--enable-debug enable debug [default=no]
--enable-dbus enable DBus support [default=yes]
--enable-libnotify enable libnotify support [default=no]
--enable-mempool enable memory pool [default=no]
--enable-libarchive [enable the libarchive filter [default=no]
--enable-chmlib [enable the chmlib filter [default=no]
Enable support for libarchive and chmlib if the necessary
libraries are available. Enable libnotify support when building
on BSD systems. Other switches should most likely stay unchanged.
In order to build from the git repository, follow these steps:
$ git clone https://github.com/FabriceColin/pinot.git
$ cd pinot
$ touch ChangeLog
$ ./autogen.sh --prefix=/usr --libdir=/usr/lib64 --sysconfdir=/etc \
--enable-debug=yes --enable-libarchive=yes --enable-chmlib=yes
See the list below for dependencies. The version numbers indicate
the minimum version Pinot has been tested with; older versions may
or may not work.
---------------------------------------------------------------
Libraries and tools Version
---------------------------------------------------------------
SQLite 3.3.1
http://www.sqlite.org/
xapian-core 1.4.10
http://www.xapian.org/
zlib 1.2.0
http://www.gzip.org/zlib/
curl (1) 7.13.1
http://curl.haxx.se/
- OR -
neon (1) 0.24.7
http://www.webdav.org/neon/
gdbus-codegen-glibmm (2)
https://github.com/Pelagicore/gdbus-codegen-glibmm
gtkmm 3.24
http://www.gtkmm.org/
libxml++ 2.12.0
http://libxmlplusplus.sourceforge.net/
libexttextcat 3.2
http://cgit.freedesktop.org/libreoffice/libexttextcat/
gmime (3) 2.6.0
http://spruce.sourceforge.net/gmime
boost (4) 1.75
http://www.boost.org/
D-Bus with GLib bindings 0.61
http://www.freedesktop.org/wiki/Software/dbus
shared-mime-info 0.17
http://freedesktop.org/Software/shared-mime-info
desktop-file-utils 0.10
http://www.freedesktop.org/software/desktop-file-utils
TagLib 1.4
http://ktown.kde.org/~wheeler/taglib/
libarchive (5) 3.0.0
https://libarchive.org/
exiv2 0.21
http://www.exiv2.org/
chmlib (6) 0.40
http://www.jedrea.com/chmlib/
openssh-askpass (7) 4.3
http://www.openssh.com/portable.html
---------------------------------------------------------------
External filter programs
---------------------------------------------------------------
unzip
http://www.info-zip.org/pub/infozip/UnZip.html
pdftotext
http://www.foolabs.com/xpdf/
http://poppler.freedesktop.org/
antiword
http://www.winfield.demon.nl/
unrtf
http://www.gnu.org/software/unrtf/unrtf.html
rst2html
http://docutils.sourceforge.net
djvutxt
http://djvu.sourceforge.net/
catdvi
http://catdvi.sourceforge.net/
catppt
xls2csv
http://www.wagner.pp.ru/~vitus/software/catdoc/
---------------------------------------------------------------------
Notes :
(1) enabled with "./configure --with-http=neon|curl"
(2) only to regenerate DBus code, with "make dbus-code"
(3) for gmime 2.4.0 support, edit configure.in
(4) for building only
with boost > 1.48 and < 1.54, turning off memory pooling with "./configure --enable-mempool=no" may be preferable
(5) optional - enabled with "./configure --enable-libarchive=yes"
(6) optional - enabled with "./configure --enable-chmlib=yes"
(7) experimental - required only if _SSH_TUNNEL is set
---------------------------------------------------------------------
|