File: README

package info (click to toggle)
tagcoll2 2.0.14-1
links: PTS, VCS
area: main
in suites: jessie, jessie-kfreebsd
size: 2,524 kB
ctags: 1,624
sloc: sh: 11,709; cpp: 6,201; makefile: 195; ansic: 74; lex: 54; yacc: 27
file content (372 lines) | stat: -rw-r--r-- 13,546 bytes
parent folder | download | duplicates (8)
README for tagcoll 1.5.2, still unreleased
==========================================

Functionality
-------------

A tagged collection is a set of items in which each item is associated with a
set of zero or more tags, in no particular order.

Or, more briefly, a collection of things with categories attached.

tagcoll performs operations with tagged collections.  It has been written
with the purpose of studying tagged collections and experiment with tagged
collection algorithms.

tagcoll comes with commandline help for a quick option summary and a complete
manpage with documentation for all the switches and the format of input and
output data.
 
The package finally builds some API documentation using doxygen.  Not much of
the source code has useful comments yets, but since doxygen support is in
place, this situation is going to improve rapidly.

I'd still like to discuss design issues with interested people, and until when
the library will be used by a wider range of applications and will prove to be
sound enough for general use, the API is subject to changes.

Usage examples of libtagcoll other than the ones in the tests and
documentation, can be found in various tools that link this library, like
tagcoll, taggrep, tagcolledit, libdebtags, debtags and debtags-edit.

Tagcoll and taggrep are especially suited as examples because most of their
complexity resides in the library itself and they use it in a very
straightforward way.

Please get in touch with me if you start using libtagcoll in some projects:
I would like to involve library users in design discussions and API changes.


What is in this release
-----------------------

Bumped major release since we aggressively changed the API.

TDBFile now copy operations disabled, because it contains a TDB handle which is
unsafe to copy (it leads to double closes and deallocs).  I might consider, in
the future, to implement proper copy operations if I find a need for it.



Compiling from the subversion repository
----------------------------------------

Additionally to the build-dependencies, you need to install ``libtool`` and
``automake1.9`::

  apt-get install libtool automake1.9
  ./autogen.sh
  ./configure
  make

The library uses GNU Libtool.  While this is very nice, it can be a pain when
trying to debug a program. For that reason, compilation of shared libraries can
be turned off by specifying the ``--disable-shared`` option to ``configure``.


Resources
---------

  Website:

    http://debtags.alioth.debian.org

  Mailing lists:
  
    debtags-devel@lists.alioth.debian.org
    http://lists.alioth.debian.org/mailman/listinfo/debtags-devel
      Development and usage discussions
  
    debtags-commits@lists.alioth.debian.org
    http://lists.alioth.debian.org/mailman/listinfo/debtags-commmits
      Postings of subversion logs

  Subversion repository:

    http://svn.debian.org/viewcvs/debtags/libdebtags
      Browse online
    svn://svn.debian.org/debtags/libdebtags
      Read-only access
    svn+ssh://alioth.debian.org/svn/debtags/libdebtags
      Write access

  APT repository:

    deb http://debtags.alioth.debian.org/debian unstable main


Further stages of the development
---------------------------------

Development will now proceed by:

 * Trying to find better data structure and algorithms to improve the
   efficiency of the various features provided by debtags
 * Possible restructurings to provide a cleaner API



TODO-list items completed so far
--------------------------------

These are the TODO-list items completed so far::

--- Done in 1.6.2
 - Fixed alignment issues causing bus errors in some arches
 - Compile tools before tests, so that tests/tagcoll-test can find
   tools/tagcoll

--- Done in 1.6.1
 - Compiles with g++ 4.1

--- Done in 1.6
 - Added grep and items commands to tagcoll
 - Added tagidx utility
 - Implemented new on-disk index
 - Implemented benchmarks
 - Fixed the tests
 - Now run the tests when building the package

--- Done in 1.5.2
 + Internal refactor of the tagcoll utility
 + Tagcoll can now parse multiple input files, and defaults to stdin if no
   input file was specified
 + Added tagcoll grep and tagcoll items

--- Done in 1.5
 + Reverted the 2.0 bump, which wasn't really useful
 + Removed some attempts to preserve clashes, which really don't make sense in
   the current state of things
 + Removed the DONE file: one can fish the old TODO-lists up from the README in
   old versions in tag/something in the repository

--- Done in 2.0
 + Bumped major version to allow coexistance with old libtagcoll1-dev

--- Done in 1.4.1
 + Remove invalid items when loading a patch from disk
 + A bit more resilience against invalid items in PatchCollection and
   TDBDiskIndex
 + Removed Filter's consumeItems* default methods, that caused unexpected
   behaviour on implementing classes

--- Done in 1.4
 + Refactored the library: saner names, no conflicting names for virtual
   functions
 + Removed TagcollReverser: the functionality has been integrated into
   ItemGrouper and, to a lesser extent, into InputMerger
 + Redesigned the serializers
 + TDBDiskIndex: use two on-disk indexes: one for tag->item and one for
   item->tag (done since a while, didn't get in the done part until now)
 = Report a wishlist bug to libtool asking for a feature to generate PIC
   libraries
 + Document header files and main classes, with usage examples rather than long
   explanations
 + Add more tests

--- Done in 1.3
 + Merged source packages tagcoll and libtagcoll1
 
--- Done in 1.2.1
 + Implemented --remove-unfaceted option
 + Implemented --remove-tags option

--- Done in 1.0.7
 + Added ExpressionTagRemover
 
--- Fixed in 1.0.6
 + Generalised ExpressionFilter on the tag part

--- Fixed in 1.0.5
 + Transition to gcc4

(older completed items are archived in the DONE file)



TODO-list items being worked on
-------------------------------

These are the TODO-list items currently being worked on::

 - Make tagidx usable to power the central database on Alioth:
    + mmap-based fast index
    + patch directory

 - Change the 'related' function to return a vector of itemsets, one for every
   distance
 - Experiment with different 'related' algorithms: the current one is hamming
   distance, but it could also me 'number of common tags', which would also be
   more efficient to compute

 - Create more IntIndex specific optimized methods instead of using the default
   ones

 - Read http://people.lis.uiuc.edu/~twidale/irinterfaces/4othersystems.html

 - Replace ParserInputs with C++ isostreams

 - Two navigational modes:
    - Tree (navigation through refining, with tree generated by smart hiearchy
      and normalization)
    - Graph (every tagset is a node, every addition of removal of a tag the
      arch to another tagset.  It can be navigated using an hyperbolic graph,
      but a prototype can be done by generating one graphwiz graph per tagset
      with all nodes at distance 2, possibly generating an hyperlinked output
      like doxygen does

 - New grouping algorithm:
    - Define a group cardinality minimum threshold (kind of around 7) and
      maximum threshold (kind of around 14)
    - Identify all tagsets with cardinality > maximum threshold, and consider
      them immutable
    - Identify all tagsets with cardinality < minimum threshold, and merge them
      with the nearest tagsets so that the cardinality of the resulting set is
      still < maximum threshold.  Merge could happen only among tagsets at
      distance 1, or one could have hints to give weight to tags, and compute a
      weighted distance that considers the relevance to the user of the various
      different tags (for some users, a change in implemented-in::* could mean
      near to nothing).
    + Use a collection of Hints for having a preference for the tags of the
      resulting set (for example, implemented-in::* could be less important
      than use::*) or just use the merge of all tags in the merged tagsets as
      the resulting tagset, or use the intersection, or handle merged tagsets
      specially.  The intersection is probably better, especially if weighted
      distance is used before and then the tags cut out by the intesection
      would be the less relevant ones already.
       - Classes could be templatized on a 'distance' functor, or passed the
	 distance functor
    - If small groups remain which can't be merged because all the nearby
      groups are big, merge all of them who are smaller than an extra minimum
      (say 2 or 3) anyway, using the nearest set regardless of its cardinality
    - Try to run a smart hierarchy on the results
    - Hints could be a map of Expression -> weight, so that multiple tags can
      be assigned the same weight (like use::* -> 10, implemented-in::*->1)
   This should normalise the 'special' items somehow.

 - Create an HTML-browser generator using the normalized collection

 - Merge ItemGrouper and TDBIndexer
   
 - Add example code


Future TODO-list items
----------------------

These are the TODO-list items that are to be addressed in the future::

 - Do a facet filter, to only get in packages with given facets.  It could be
   used for example to prepare the data to then create a smart hierarchy using
   only facets relevant to a given environment
 - Have the related command accept a tagset, and not just a package name, as
   the start of a search [ender]
 - [herv] perform some cleanup before building the .deb
   > The problem is, it takes a lot of time for me to dwelve into the
   > automake info files to find out how to add to the distclean target.
   > Do you have some hint on where to look at?
   CLEANFILES = *~ .*~ *.bak *.org gmon.out core \#*\# .\#*
   in .am files. You can add that in a include file, included by every  
   Makefile.am...
   (it should be superseded by building on a fresh svn checkout)
 - Replace Consumers with functors:
   class Collection
	{
	  addUntagged(PACKAGE& p)
	  addTagged(PACKAGE& p, TAGS& t)
	  addUntaggedGroup(PACKAGES* p)
	  addTaggedGroup(PACKAGEs& p, TAGS& t)
	}
	class CollectionDispatcher
	{
	  Collection& c;
	  public:
	  void operator() (PACKAGE& p) { c.addUntagged(p); }
	  void operator() (PACKAGE& p, TAGS& t) { c.addTagged(p, t); }
	  void operator() (PACKAGES& p) { c.addUntaggedGroup(p); }
	  void operator() (PACKAGES& pm TAGS& t) { c.addTaggedGroup(p, t); }
	}
   (maybe it doesn't make sense)
	    
 - Tagcoll1
    - Classe Tag, smart pointer, con TagImpl virtuale
       - Metodo TagImpl* Tag::impl impl(); da usare per chi ha bisogno di
      	accedere a funzioni speciali dell'impl
       - Compare fatto esclusivamente per nome del tag, cos da rendere
      	mescolabili Tag di tipo diverso
       - Facet e quant'altro di conseguenza
    - Tag rappresentato con un'URL: Organization/facet/tag o
      Organization/Facet#tag

 - Hierarchization algorithms:
   If mergeequivalent is called on a child node of a tree node, the
   resulted hiearchy might come out with a single root node.  In that
   case, the node previously selected is equivalent to the root.
   Option 1: take the parent into account when merging
   Option 2: self-expand lone child nodes

 - Implement cleanup of the tagcoll with noise filtering (removing tags with
   cardinality less than a given threshold) and merging of equivalent tags.
   Implement as a method of TagCollection, that gives the new, cleaned
   collection.

    - Another optimization would be to remove intermediate hierarchy nodes wich
      contain a single child node and no elements.

      It could just be implemented by merging tagsets with cardinality under a
      given threshold into the nearest, bigger one.  Like filtering
      insignificant or "noise" tag data before building the structure.

      The optimization must be kept optional, because it induces a loss of tag
      informations if the resulting tagged collection is used as the only way
      to store the results like with tagbk

 - Document the tree-cleaning features:
    - noise reduction
    - merging of equivalent items
    - flattenThreshold

 - Implications can be seen as a simple form of derived tags: handling them in
   that way could avoid the need of multiple
   expand-implications/add-derived-tags/expand-implications cycles


Discarded TODO-list items
-------------------------

These TODO-list items have been discarded::

 = Use a generic class instead of Tag
    - Must be comparable
       - Must have a name() method
       - Must have a fullname() method
       - Must have a facet() method
       - Must be assignable
       - Must be possible to construct the "invalid" entry using the
	 constructor with no parameters
   (doesn't make sense anymore: Tag has been moved to libdebtags1)
 = Use a generic class instead of Facet
    - Must be comparable
       - Must have a name() method
       - Must have a tags() method
       - Must be assignable
       - Must be possible to construct the "invalid" entry using the
	 constructor with no parameters
   (doesn't make sense anymore: Facet has been moved to libdebtags1)
 = Use a generic class instead of Vocabulary
    - Must have a getFacet(const string& name) method
    - Must have a getTag(const string& name) method
   (doesn't make sense anymore: Vocabulary has been moved to libdebtags1)


Useful resources
----------------


Author
------

Enrico Zini <enrico@debian.org>

Erich Schubert and Herv Eychenne have contributed a great deal of feedback and
ideas in the early stages of development.