
|
README for tagcoll 1.5.2, still unreleased
==========================================
Functionality
-------------
A tagged collection is a set of items in which each item is associated with a
set of zero or more tags, in no particular order.
Or, more briefly, a collection of things with categories attached.
tagcoll performs operations with tagged collections. It has been written
with the purpose of studying tagged collections and experiment with tagged
collection algorithms.
tagcoll comes with commandline help for a quick option summary and a complete
manpage with documentation for all the switches and the format of input and
output data.
The package finally builds some API documentation using doxygen. Not much of
the source code has useful comments yets, but since doxygen support is in
place, this situation is going to improve rapidly.
I'd still like to discuss design issues with interested people, and until when
the library will be used by a wider range of applications and will prove to be
sound enough for general use, the API is subject to changes.
Usage examples of libtagcoll other than the ones in the tests and
documentation, can be found in various tools that link this library, like
tagcoll, taggrep, tagcolledit, libdebtags, debtags and debtags-edit.
Tagcoll and taggrep are especially suited as examples because most of their
complexity resides in the library itself and they use it in a very
straightforward way.
Please get in touch with me if you start using libtagcoll in some projects:
I would like to involve library users in design discussions and API changes.
What is in this release
-----------------------
Bumped major release since we aggressively changed the API.
TDBFile now copy operations disabled, because it contains a TDB handle which is
unsafe to copy (it leads to double closes and deallocs). I might consider, in
the future, to implement proper copy operations if I find a need for it.
Compiling from the subversion repository
----------------------------------------
Additionally to the build-dependencies, you need to install ``libtool`` and
``automake1.9`::
apt-get install libtool automake1.9
./autogen.sh
./configure
make
The library uses GNU Libtool. While this is very nice, it can be a pain when
trying to debug a program. For that reason, compilation of shared libraries can
be turned off by specifying the ``--disable-shared`` option to ``configure``.
Resources
---------
Website:
http://debtags.alioth.debian.org
Mailing lists:
debtags-devel@lists.alioth.debian.org
http://lists.alioth.debian.org/mailman/listinfo/debtags-devel
Development and usage discussions
debtags-commits@lists.alioth.debian.org
http://lists.alioth.debian.org/mailman/listinfo/debtags-commmits
Postings of subversion logs
Subversion repository:
http://svn.debian.org/viewcvs/debtags/libdebtags
Browse online
svn://svn.debian.org/debtags/libdebtags
Read-only access
svn+ssh://alioth.debian.org/svn/debtags/libdebtags
Write access
APT repository:
deb http://debtags.alioth.debian.org/debian unstable main
Further stages of the development
---------------------------------
Development will now proceed by:
* Trying to find better data structure and algorithms to improve the
efficiency of the various features provided by debtags
* Possible restructurings to provide a cleaner API
TODO-list items completed so far
--------------------------------
These are the TODO-list items completed so far::
--- Done in 1.6.2
- Fixed alignment issues causing bus errors in some arches
- Compile tools before tests, so that tests/tagcoll-test can find
tools/tagcoll
--- Done in 1.6.1
- Compiles with g++ 4.1
--- Done in 1.6
- Added grep and items commands to tagcoll
- Added tagidx utility
- Implemented new on-disk index
- Implemented benchmarks
- Fixed the tests
- Now run the tests when building the package
--- Done in 1.5.2
+ Internal refactor of the tagcoll utility
+ Tagcoll can now parse multiple input files, and defaults to stdin if no
input file was specified
+ Added tagcoll grep and tagcoll items
--- Done in 1.5
+ Reverted the 2.0 bump, which wasn't really useful
+ Removed some attempts to preserve clashes, which really don't make sense in
the current state of things
+ Removed the DONE file: one can fish the old TODO-lists up from the README in
old versions in tag/something in the repository
--- Done in 2.0
+ Bumped major version to allow coexistance with old libtagcoll1-dev
--- Done in 1.4.1
+ Remove invalid items when loading a patch from disk
+ A bit more resilience against invalid items in PatchCollection and
TDBDiskIndex
+ Removed Filter's consumeItems* default methods, that caused unexpected
behaviour on implementing classes
--- Done in 1.4
+ Refactored the library: saner names, no conflicting names for virtual
functions
+ Removed TagcollReverser: the functionality has been integrated into
ItemGrouper and, to a lesser extent, into InputMerger
+ Redesigned the serializers
+ TDBDiskIndex: use two on-disk indexes: one for tag->item and one for
item->tag (done since a while, didn't get in the done part until now)
= Report a wishlist bug to libtool asking for a feature to generate PIC
libraries
+ Document header files and main classes, with usage examples rather than long
explanations
+ Add more tests
--- Done in 1.3
+ Merged source packages tagcoll and libtagcoll1
--- Done in 1.2.1
+ Implemented --remove-unfaceted option
+ Implemented --remove-tags option
--- Done in 1.0.7
+ Added ExpressionTagRemover
--- Fixed in 1.0.6
+ Generalised ExpressionFilter on the tag part
--- Fixed in 1.0.5
+ Transition to gcc4
(older completed items are archived in the DONE file)
TODO-list items being worked on
-------------------------------
These are the TODO-list items currently being worked on::
- Make tagidx usable to power the central database on Alioth:
+ mmap-based fast index
+ patch directory
- Change the 'related' function to return a vector of itemsets, one for every
distance
- Experiment with different 'related' algorithms: the current one is hamming
distance, but it could also me 'number of common tags', which would also be
more efficient to compute
- Create more IntIndex specific optimized methods instead of using the default
ones
- Read http://people.lis.uiuc.edu/~twidale/irinterfaces/4othersystems.html
- Replace ParserInputs with C++ isostreams
- Two navigational modes:
- Tree (navigation through refining, with tree generated by smart hiearchy
and normalization)
- Graph (every tagset is a node, every addition of removal of a tag the
arch to another tagset. It can be navigated using an hyperbolic graph,
but a prototype can be done by generating one graphwiz graph per tagset
with all nodes at distance 2, possibly generating an hyperlinked output
like doxygen does
- New grouping algorithm:
- Define a group cardinality minimum threshold (kind of around 7) and
maximum threshold (kind of around 14)
- Identify all tagsets with cardinality > maximum threshold, and consider
them immutable
- Identify all tagsets with cardinality < minimum threshold, and merge them
with the nearest tagsets so that the cardinality of the resulting set is
still < maximum threshold. Merge could happen only among tagsets at
distance 1, or one could have hints to give weight to tags, and compute a
weighted distance that considers the relevance to the user of the various
different tags (for some users, a change in implemented-in::* could mean
near to nothing).
+ Use a collection of Hints for having a preference for the tags of the
resulting set (for example, implemented-in::* could be less important
than use::*) or just use the merge of all tags in the merged tagsets as
the resulting tagset, or use the intersection, or handle merged tagsets
specially. The intersection is probably better, especially if weighted
distance is used before and then the tags cut out by the intesection
would be the less relevant ones already.
- Classes could be templatized on a 'distance' functor, or passed the
distance functor
- If small groups remain which can't be merged because all the nearby
groups are big, merge all of them who are smaller than an extra minimum
(say 2 or 3) anyway, using the nearest set regardless of its cardinality
- Try to run a smart hierarchy on the results
- Hints could be a map of Expression -> weight, so that multiple tags can
be assigned the same weight (like use::* -> 10, implemented-in::*->1)
This should normalise the 'special' items somehow.
- Create an HTML-browser generator using the normalized collection
- Merge ItemGrouper and TDBIndexer
- Add example code
Future TODO-list items
----------------------
These are the TODO-list items that are to be addressed in the future::
- Do a facet filter, to only get in packages with given facets. It could be
used for example to prepare the data to then create a smart hierarchy using
only facets relevant to a given environment
- Have the related command accept a tagset, and not just a package name, as
the start of a search [ender]
- [herv] perform some cleanup before building the .deb
> The problem is, it takes a lot of time for me to dwelve into the
> automake info files to find out how to add to the distclean target.
> Do you have some hint on where to look at?
CLEANFILES = *~ .*~ *.bak *.org gmon.out core \#*\# .\#*
in .am files. You can add that in a include file, included by every
Makefile.am...
(it should be superseded by building on a fresh svn checkout)
- Replace Consumers with functors:
class Collection
{
addUntagged(PACKAGE& p)
addTagged(PACKAGE& p, TAGS& t)
addUntaggedGroup(PACKAGES* p)
addTaggedGroup(PACKAGEs& p, TAGS& t)
}
class CollectionDispatcher
{
Collection& c;
public:
void operator() (PACKAGE& p) { c.addUntagged(p); }
void operator() (PACKAGE& p, TAGS& t) { c.addTagged(p, t); }
void operator() (PACKAGES& p) { c.addUntaggedGroup(p); }
void operator() (PACKAGES& pm TAGS& t) { c.addTaggedGroup(p, t); }
}
(maybe it doesn't make sense)
- Tagcoll1
- Classe Tag, smart pointer, con TagImpl virtuale
- Metodo TagImpl* Tag::impl impl(); da usare per chi ha bisogno di
accedere a funzioni speciali dell'impl
- Compare fatto esclusivamente per nome del tag, cos da rendere
mescolabili Tag di tipo diverso
- Facet e quant'altro di conseguenza
- Tag rappresentato con un'URL: Organization/facet/tag o
Organization/Facet#tag
- Hierarchization algorithms:
If mergeequivalent is called on a child node of a tree node, the
resulted hiearchy might come out with a single root node. In that
case, the node previously selected is equivalent to the root.
Option 1: take the parent into account when merging
Option 2: self-expand lone child nodes
- Implement cleanup of the tagcoll with noise filtering (removing tags with
cardinality less than a given threshold) and merging of equivalent tags.
Implement as a method of TagCollection, that gives the new, cleaned
collection.
- Another optimization would be to remove intermediate hierarchy nodes wich
contain a single child node and no elements.
It could just be implemented by merging tagsets with cardinality under a
given threshold into the nearest, bigger one. Like filtering
insignificant or "noise" tag data before building the structure.
The optimization must be kept optional, because it induces a loss of tag
informations if the resulting tagged collection is used as the only way
to store the results like with tagbk
- Document the tree-cleaning features:
- noise reduction
- merging of equivalent items
- flattenThreshold
- Implications can be seen as a simple form of derived tags: handling them in
that way could avoid the need of multiple
expand-implications/add-derived-tags/expand-implications cycles
Discarded TODO-list items
-------------------------
These TODO-list items have been discarded::
= Use a generic class instead of Tag
- Must be comparable
- Must have a name() method
- Must have a fullname() method
- Must have a facet() method
- Must be assignable
- Must be possible to construct the "invalid" entry using the
constructor with no parameters
(doesn't make sense anymore: Tag has been moved to libdebtags1)
= Use a generic class instead of Facet
- Must be comparable
- Must have a name() method
- Must have a tags() method
- Must be assignable
- Must be possible to construct the "invalid" entry using the
constructor with no parameters
(doesn't make sense anymore: Facet has been moved to libdebtags1)
= Use a generic class instead of Vocabulary
- Must have a getFacet(const string& name) method
- Must have a getTag(const string& name) method
(doesn't make sense anymore: Vocabulary has been moved to libdebtags1)
Useful resources
----------------
Author
------
Enrico Zini <enrico@debian.org>
Erich Schubert and Herv Eychenne have contributed a great deal of feedback and
ideas in the early stages of development.
|