File: ghmm.py

package info (click to toggle)
ghmm 0.9~rc3-11
links: PTS, VCS
area: main
in suites: trixie
size: 5,172 kB
sloc: ansic: 25,557; sh: 11,204; python: 6,739; xml: 1,515; makefile: 309
file content (5149 lines) | stat: -rw-r--r-- 203,032 bytes
parent folder | download | duplicates (2)
#!/usr/bin/python3
################################################################################
#
#       This file is part of the General Hidden Markov Model Library,
#       GHMM version __VERSION__, see http://ghmm.org
#
#       file:    ghmm.py
#       authors: Benjamin Georgi, Wasinee Rungsarityotin, Alexander Schliep,
#                Janne Grunau
#
#       Copyright (C) 1998-2004 Alexander Schliep
#       Copyright (C) 1998-2001 ZAIK/ZPR, Universitaet zu Koeln
#       Copyright (C) 2002-2004 Max-Planck-Institut fuer Molekulare Genetik,
#                               Berlin
#
#       Contact: schliep@ghmm.org
#
#       This library is free software; you can redistribute it and/or
#       modify it under the terms of the GNU Library General Public
#       License as published by the Free Software Foundation; either
#       version 2 of the License, or (at your option) any later version.
#
#       This library is distributed in the hope that it will be useful,
#       but WITHOUT ANY WARRANTY; without even the implied warranty of
#       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
#       Library General Public License for more details.
#
#       You should have received a copy of the GNU Library General Public
#       License along with this library; if not, write to the Free
#       Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#
#
#
################################################################################

"""@mainpage GHMM - an open source library for Hidden Markov Models (HMM)

HMMs are stochastic models which encode a probability density over
sequences of symbols. These symbols can be discrete letters (A,C,G and
T for DNA; 1,2,3,4,5,6 for dice), real numbers (weather measurement
over time: temperature) or vectors of either or the combination
thereof (weather again: temperature, pressure, percipitation).

@note
We will always talk about emissions, emission sequence and so
forth when we refer to the sequence of symbols. Another name
for the same object is observation resp. observation sequence.

A simple model with a fair and one unfair coin can be created as follows

>> fair = [0.5, 0.5]
>> loaded = [0.9, 0.1]
>> A = [[0.9, 0.1], [0.3, 0.7]]
>> pi = [0.9, 0.1]
>> B = [fair, loaded]
>> sigma = ghmm.IntegerRange(0,2)
>> m = ghmm.HMMFromMatrices(sigma, ghmm.DiscreteDistribution(sigma), A, B, pi)

The objects one has to deal with in HMM modelling are the following

-# The domain the emissions come from: the EmissionDomain. Domain
   is to be understood mathematically and to encompass both discrete,
   finite alphabets and fields such as the real numbers or intervals
   of the reals.\n
   For technical reasons there can be two representations of an
   emission symbol: an external and an internal. The external
   representation is the view of the application using ghmm.py. The
   internal one is what is used in both ghmm.py and the ghmm
   C-library. Representations can coincide, but this is not
   guaranteed. Discrete alphabets of size k are represented as
   [0,1,2,...,k-1] internally.  It is the domain objects job to
   provide a mapping between representations in both directions.
   @note
   Do not make assumptions about the internal
   representations. It might change.

-# Every domain has to afford a distribution, which is usually
   parameterized. A distribution associated with a domain
   should allow us to compute \f$Prob[x| distribution parameters]\f$
   efficiently.\n
   The distribution defines the \b type of distribution which
   we will use to model emissions in <b>every state</b> of the HMM.
   The \b type of distribution will be identical for all states,
   their \b parameterizations will differ from state to state.

-# We will consider a Sequence of emissions from the same emission
   domain and very often sets of such sequences: SequenceSet

-# The HMM: The HMM consists of two major components: A Markov chain
   over states (implemented as a weighted directed graph with
   adjacency and inverse-adjacency lists) and the emission
   distributions per-state. For reasons of efficiency the HMM itself
   is *static*, as far as the topology of the underlying Markov chain
   (and obviously the EmissionDomain) are concerned. You cannot add or
   delete transitions in an HMM.\n
   Transition probabilities and the parameters of the per-state
   emission distributions can be easily modified. Particularly,
   Baum-Welch reestimation is supported.  While a transition cannot be
   deleted from the graph, you can set the transition probability to
   zero, which has the same effect from the theoretical point of
   view. However, the corresponding edge in the graph is still
   traversed in the computation.\n
   States in HMMs are referred to by their integer index. State sequences
   are simply list of integers.\n
   If you want to store application specific data for each state you
   have to do it yourself.\n
   Subclasses of HMM implement specific types of HMM. The type depends
   on the EmissionDomain, the Distribution used, the specific
   extensions to the 'standard' HMMs and so forth
"""

import ghmmwrapper
import ghmmhelper
import modhmmer
import re
import io
import copy
import math
import sys
import os
import logging
from string import join
from textwrap import fill

# Initialize logging to stderr
#logging.basicConfig(format="%(asctime)s %(filename)s:%(lineno)d %(levelname)-5s - %(message)s")
log = logging.getLogger("GHMM")

# creating StreamHandler to stderr
hdlr = logging.StreamHandler(sys.stderr)

# setting message format
#fmt = logging.Formatter("%(name)s %(asctime)s %(filename)s:%(lineno)d %(levelname)s %(thread)-5s - %(message)s")
fmt = logging.Formatter("%(name)s %(filename)s:%(lineno)d - %(message)s")
hdlr.setFormatter(fmt)

# adding handler to logger object
log.addHandler(hdlr)

# Set the minimal severity of a message to be shown. The levels in
# increasing severity are: DEBUG, INFO, WARNING, ERROR, CRITICAL

log.setLevel(logging.WARNING)
log.info( " I'm the ghmm in "+ __file__)

c_log = [log.critical, log.error, log.warning, log.info, log.debug]
def logwrapper(level, message):
    c_log[level](message)

ghmmwrapper.set_pylogging(logwrapper)

# Initialize global random number generator by system time
ghmmwrapper.ghmm_rng_init()
ghmmwrapper.time_seed()


#-------------------------------------------------------------------------------
#- Exceptions ------------------------------------------------------------------

class GHMMError(Exception):
    """Base class for exceptions in this module."""
    def __init__(self, message):
        self.message = message
    def __str__(self):
        return repr(self.message)

class UnknownInputType(GHMMError):
    def __init__(self,message):
        self.message = message
    def __str__(self):
        return repr(self.message)


class NoValidCDataType(GHMMError):
    def __init__(self,message):
        self.message = message
    def __str__(self):
        return repr(self.message)


class badCPointer(GHMMError):
    def __init__(self,message):
        self.message = message
    def __str__(self):
        return repr(self.message)


class SequenceCannotBeBuild(GHMMError):
    def __init__(self,message):
        self.message = message
    def __str__(self):
        return repr(self.message)

class InvalidModelParameters(GHMMError):
    def __init__(self,message):
        self.message = message
    def __str__(self):
        return repr(self.message)

class GHMMOutOfDomain(GHMMError):
    def __init__(self,message):
        self.message = message
    def __str__(self):
        return repr(self.message)

class UnsupportedFeature(GHMMError):
    def __init__(self,message):
        self.message = message
    def __str__(self):
        return repr(self.message)

class WrongFileType(GHMMError):
    def __init__(self,message):
        self.message = message
    def __str__(self):
        return repr(self.message)

class ParseFileError(GHMMError):
    def __init__(self,message):
        self.message = message
    def __str__(self):
        return repr(self.message)

#-------------------------------------------------------------------------------
#- constants -------------------------------------------------------------------
kNotSpecified            = ghmmwrapper.kNotSpecified
kLeftRight               = ghmmwrapper.kLeftRight
kSilentStates            = ghmmwrapper.kSilentStates
kTiedEmissions           = ghmmwrapper.kTiedEmissions
kHigherOrderEmissions    = ghmmwrapper.kHigherOrderEmissions
kBackgroundDistributions = ghmmwrapper.kBackgroundDistributions
kLabeledStates           = ghmmwrapper.kLabeledStates
kTransitionClasses       = ghmmwrapper.kTransitionClasses
kDiscreteHMM             = ghmmwrapper.kDiscreteHMM
kContinuousHMM           = ghmmwrapper.kContinuousHMM
kPairHMM                 = ghmmwrapper.kPairHMM
types = {
    kLeftRight:'kLeftRight',
    kSilentStates:'kSilentStates',
    kTiedEmissions:'kTiedEmissions',
    kHigherOrderEmissions:'kHigherOrderEmissions',
    kBackgroundDistributions:'kBackgroundDistributions',
    kLabeledStates:'kLabeledStates',
    kTransitionClasses:'kTransitionClasses',
    kDiscreteHMM:'kDiscreteHMM',
    kContinuousHMM:'kContinuousHMM',
    kPairHMM:'kPairHMM',
    }
#-------------------------------------------------------------------------------
#- EmissionDomain and derived  -------------------------------------------------
class EmissionDomain(object):
    """ Abstract base class for emissions produced by an HMM.

    There can be two representations for emissions:
    -# An internal, used in ghmm.py and the ghmm C-library
    -# An external, used in your particular application

    Example:\n
    The underlying library represents symbols from a finite,
    discrete domain as integers (see Alphabet).

    EmissionDomain is the identity mapping
    """

    def internal(self, emission):
        """ Given a emission return the internal representation
        """
        return emission


    def internalSequence(self, emissionSequence):
        """ Given a emissionSequence return the internal representation
        """
        return emissionSequence


    def external(self, internal):
        """ Given an internal representation return the external representation
        """
        return internal

    def externalSequence(self, internalSequence):
        """ Given a sequence with the internal representation return the external
        representation
        """
        return internalSequence


    def isAdmissable(self, emission):
        """ Check whether p emission is admissable (contained in) the domain
        raises GHMMOutOfDomain else
        """
        return None


class Alphabet(EmissionDomain):
    """ Discrete, finite alphabet

    """
    def __init__(self, listOfCharacters):
        """
        Creates an alphabet out of a listOfCharacters
        @param listOfCharacters a list of strings (single characters most of
        the time), ints, or other objects that can be used as dictionary keys
        for a mapping of the external sequences to the internal representation
        or can alternatively be a SWIG pointer to a
        C alphabet_s struct

        @note
        Alphabets should be considered as imutable. That means the
        listOfCharacters and the mapping should never be touched after
        construction.
        """
        self.index = {} # Which index belongs to which character

        if type(listOfCharacters) is ghmmwrapper.ghmm_alphabet:
            self.listOfCharacters = [listOfCharacters.getSymbol(i) for 
                    i in range(listOfCharacters.size)]
        else:
            self.listOfCharacters = listOfCharacters

        for i,c in enumerate(self.listOfCharacters):
            self.index[c] = i

        lens = {}
        try:
            for c in self.listOfCharacters:
                lens[len(c)] = 1
        except TypeError:
            self._lengthOfCharacters = None
        else:
            if len(lens) == 1:
                self._lengthOfCharacters = list(lens.keys())[0]
            else:
                self._lengthOfCharacters = None

        self.CDataType = "int" # flag indicating which C data type should be used


    def __str__(self):
        strout = ["<Alphabet:"]
        strout.append( str(self.listOfCharacters) +'>')

        return join(strout,'')

    def verboseStr(self):
        strout = ["GHMM Alphabet:\n"]
        strout.append("Number of symbols: " + str(len(self)) + "\n")
        strout.append("External: " + str(self.listOfCharacters) + "\n")
        strout.append("Internal: " + str(list(range(len(self)))) + "\n")
        return join(strout,'')


    def __eq__(self,alph):
        if not isinstance(alph,Alphabet):
            return False
        else:
            if self.listOfCharacters == alph.listOfCharacters and self.index == alph.index and self.CDataType==alph.CDataType:
                return True
            else:
                return False

    def __len__(self):
        return len(self.listOfCharacters)

    def __hash__(self):
        #XXX rewrite
        # defining hash and eq is not recommended for mutable types.
        # => listOfCharacters should be considered immutable
        return id(self)

    #obsolete
    def size(self):
        """ @deprecated use len() instead
        """
        log.warning( "Warning: The use of .size() is deprecated. Use len() instead.")
        return len(self.listOfCharacters)


    def internal(self, emission):
        """ Given a emission return the internal representation
        """
        return self.index[emission]


    def internalSequence(self, emissionSequence):
        """ Given a emission_sequence return the internal representation

        Raises KeyError
        """
        result = copy.deepcopy(emissionSequence)
        try:
            result = [self.index[i] for i in result]
        except IndexError:
            raise KeyError
        return result


    def external(self, internal):
        """ Given an internal representation return the external representation

        @note the internal code -1 always represents a gap character '-'

        Raises KeyError
        """
        if internal == -1:
            return "-"
        if internal < -1 or len(self.listOfCharacters) < internal:
            raise KeyError("Internal symbol "+str(internal)+" not recognized.")
        return self.listOfCharacters[internal]

    def externalSequence(self, internalSequence):
        """ Given a sequence with the internal representation return the external
        representation

        Raises KeyError
        """
        result = copy.deepcopy(internalSequence)
        try:
            result = [self.listOfCharacters[i] for i in result]
        except IndexError:
            raise KeyError
        return result

    def isAdmissable(self, emission):
        """ Check whether emission is admissable (contained in) the domain
        """
        return emission in self.listOfCharacters

    def getExternalCharacterLength(self):
        """
        If all external characters are of the same length the length is
        returned. Otherwise None.
        @return length of the external characters or None
        """
        return self._lengthOfCharacters

    def toCstruct(self):
        calphabet = ghmmwrapper.ghmm_alphabet(len(self), "<unused>")
        for i,symbol in enumerate(self.listOfCharacters):
            calphabet.setSymbol(i, str(symbol))

        return calphabet


DNA = Alphabet(['a','c','g','t'])
AminoAcids = Alphabet(['A','C','D','E','F','G','H','I','K','L',
                       'M','N','P','Q','R','S','T','V','W','Y'])
def IntegerRange(a,b):
    """
    Creates an Alphabet with internal and external representation of range(a,b)
    @return Alphabet
    """
    return Alphabet(list(range(a,b)))


# To be used for labelled HMMs. We could use an Alphabet directly but this way it is more explicit.
class LabelDomain(Alphabet):
    def __init__(self, listOfLabels):
        Alphabet.__init__(self, listOfLabels)


class Float(EmissionDomain):
    """Continuous Alphabet"""

    def __init__(self):
        self.CDataType = "double" # flag indicating which C data type should be used

    def __eq__(self, other):
        return isinstance(other, Float)

    def __hash__(self):
        # defining hash and eq is not recommended for mutable types.
        # for float it is fine because it is kind of state less
        return id(self)

    def isAdmissable(self, emission):
        """ Check whether emission is admissable (contained in) the domain

        raises GHMMOutOfDomain else
        """
        return isinstance(emission,float)



#-------------------------------------------------------------------------------
#- Distribution and derived  ---------------------------------------------------
class Distribution(object):
    """ Abstract base class for distribution over EmissionDomains
    """

    # add density, mass, cumuliative dist, quantils, sample, fit pars,
    # moments


class DiscreteDistribution(Distribution):
    """ A DiscreteDistribution over an Alphabet: The discrete distribution
    is parameterized by the vectors of probabilities.

    """
    def __init__(self, alphabet):
        self.alphabet = alphabet
        self.prob_vector = None

    def set(self, prob_vector):
        self.prob_vector = prob_vector

    def get(self):
        return self.prob_vector


class ContinuousDistribution(Distribution):
    pass

class UniformDistribution(ContinuousDistribution):
    def __init__(self, domain):
        self.emissionDomain = domain
        self.max = None
        self.min = None

    def set(self, values):
        """
        @param values tuple of maximum, minimum
        """
        maximum, minimum = values
        self.max = maximum
        self.min = minimum

    def get(self):
        return (self.max, self.min)

class GaussianDistribution(ContinuousDistribution):
    # XXX attributes unused at this point
    def __init__(self, domain):
        self.emissionDomain = domain
        self.mu = None
        self.sigma = None

    def set(self, values):
        """
        @param values tuple of mu, sigma, trunc
        """
        mu, sigma = values
        self.mu = mu
        self.sigma = sigma

    def get(self):
        return (self.mu, self.sigma)

class TruncGaussianDistribution(GaussianDistribution):
    # XXX attributes unused at this point
    def __init__(self, domain):
        self.GaussianDistribution(self,domain)
        self.trunc = None

    def set(self, values):
        """
        @param values tuple of mu, sigma, trunc
        """
        mu, sigma, trunc = values
        self.mu = mu
        self.sigma = sigma
        self.trunc = trunc

    def get(self):
        return (self.mu, self.sigma, self.trunc)

class GaussianMixtureDistribution(ContinuousDistribution):
    # XXX attributes unused at this point
    def __init__(self, domain):
        self.emissionDomain = domain
        self.M = None   # number of mixture components
        self.mu = []
        self.sigma = []
        self.weight = []

    def set(self, index, values):
        """
        @param index index of mixture component
        @param values tuple of mu, sigma, w
        """
        mu, sigma, w = values
        pass

    def get(self):
        pass

class ContinuousMixtureDistribution(ContinuousDistribution):
    def __init__(self, domain):
        self.emissionDomain = domain
        self.M = 0   # number of mixture components
        self.components = []
        self.weight = []
        self.fix = []

    def add(self,w,fix,distribution):
        assert isinstance(distribution,ContinuousDistribution)
        self.M = self.M + 1
        self.weight.append(w)
        self.component.append(distribution)
        if isinstance(distribution,UniformDistribution):
            # uniform distributions are fixed by definition
            self.fix.append(1)
        else:
            self.fix.append(fix)

    def set(self, index, w, fix, distribution):
        if index >= M:
            raise IndexError

        assert isinstance(distribution,ContinuousDistribution)
        self.weight[i] = w
        self.components[i] = distribution
        if isinstance(distribution,UniformDistribution):
            # uniform distributions are fixed by definition
            self.fix[i](1)
        else:
            self.fix[i](fix)

    def get(self,i):
        assert M > i
        return (self.weigth[i],self.fix[i],self.component[i])

    def check(self):
        assert self.M == len(self.components)
        assert sum(self.weight) == 1
        assert sum(self.weight > 1) == 0
        assert sum(self.weight < 0) == 0


class MultivariateGaussianDistribution(ContinuousDistribution):
    def __init__(self, domain):
        self.emissionDomain = domain


#-------------------------------------------------------------------------------
#Sequence, SequenceSet and derived  ------------------------------------------

class EmissionSequence(object):
    """ An EmissionSequence contains the *internal* representation of
    a sequence of emissions.

    It also contains a reference to the domain where the emissions orginated from.
    """

    def __init__(self, emissionDomain, sequenceInput, labelDomain = None, labelInput = None, ParentSequenceSet=None):

        self.emissionDomain = emissionDomain

        if ParentSequenceSet is not None:
            # optional reference to a parent SequenceSet. Is needed for reference counting
            if not isinstance(ParentSequenceSet,SequenceSet):
                raise TypeError("Invalid reference. Only SequenceSet is valid.")
        self.ParentSequenceSet = ParentSequenceSet

        if self.emissionDomain.CDataType == "int":
            # necessary C functions for accessing the ghmm_dseq struct
            self.sequenceAllocationFunction = ghmmwrapper.ghmm_dseq
            self.allocSingleSeq = ghmmwrapper.int_array_alloc
            #obsolete
            if ghmmwrapper.ASCI_SEQ_FILE:
                self.seq_read = ghmmwrapper.ghmm_dseq_read
            self.seq_ptr_array_getitem = ghmmwrapper.dseq_ptr_array_getitem
            self.sequence_carray = ghmmwrapper.list2int_array
        elif self.emissionDomain.CDataType == "double":
            # necessary C functions for accessing the ghmm_cseq struct
            self.sequenceAllocationFunction = ghmmwrapper.ghmm_cseq
            self.allocSingleSeq = ghmmwrapper.double_array_alloc
            #obsolete
            if ghmmwrapper.ASCI_SEQ_FILE:
                self.seq_read = ghmmwrapper.ghmm_cseq_read
            self.seq_ptr_array_getitem = ghmmwrapper.cseq_ptr_array_getitem
            self.sequence_carray = ghmmwrapper.list2double_array
        else:
            raise NoValidCDataType("C data type " + str(self.emissionDomain.CDataType) + " invalid.")


        # check if ghmm is build with asci sequence file support
        if isinstance(sequenceInput, str) or isinstance(sequenceInput, str):
            if ghmmwrapper.ASCI_SEQ_FILE:
                if  not os.path.exists(sequenceInput):
                    raise IOError('File ' + str(sequenceInput) + ' not found.')
                else:
                    tmp = self.seq_read(sequenceInput)
                    if len(tmp) > 0:
                        self.cseq = tmp[0]
                    else:
                        raise ParseFileError('File ' + str(sequenceInput) + ' not valid.')

            else:
                raise UnsupportedFeature("asci sequence files are deprecated. Please convert your files"
                                       + " to the new xml-format or rebuild the GHMM with"
                                       + " the conditional \"GHMM_OBSOLETE\".")

        #create a ghmm_dseq with state_labels, if the appropiate parameters are set
        elif isinstance(sequenceInput, list):
            internalInput = self.emissionDomain.internalSequence(sequenceInput)
            seq = self.sequence_carray(internalInput)
            self.cseq = self.sequenceAllocationFunction(seq, len(sequenceInput))

            if labelInput is not None and labelDomain is not None:
                assert len(sequenceInput)==len(labelInput), "Length of the sequence and labels don't match."
                assert isinstance(labelInput, list), "expected a list of labels."
                assert isinstance(labelDomain, LabelDomain), "labelDomain is not a LabelDomain class."

                self.labelDomain = labelDomain

                #translate the external labels in internal
                internalLabel = self.labelDomain.internalSequence(labelInput)
                label = ghmmwrapper.list2int_array(internalLabel)
                self.cseq.init_labels(label, len(internalInput))

        # internal use
        elif isinstance(sequenceInput, ghmmwrapper.ghmm_dseq) or isinstance(sequenceInput, ghmmwrapper.ghmm_cseq):
            if sequenceInput.seq_number > 1:
                raise badCPointer("Use SequenceSet for multiple sequences.")
            self.cseq = sequenceInput
            if labelDomain != None:
                self.labelDomain = labelDomain

        else:
            raise UnknownInputType("inputType " + str(type(sequenceInput)) + " not recognized.")

    def __del__(self):
        "Deallocation of C sequence struct."
        log.debug( "__del__ EmissionSequence " + str(self.cseq))
        # if a parent SequenceSet exits, we use cseq.subseq_free() to free memory
        if self.ParentSequenceSet is not None:
            self.cseq.subseq_free()
        self.ParentSequenceSet = None


    def __len__(self):
        "Returns the length of the sequence."
        return self.cseq.getLength(0)


    def __setitem__(self, index, value):
        internalValue = self.emissionDomain.internal(value)
        self.cseq.setSymbol(0, index, internalValue)


    def __getitem__(self, index):
        """
        @returns the symbol at position 'index'.
        """
        if index < len(self):
            return self.cseq.getSymbol(0, index)
        else:
            raise IndexError

    def getSeqLabel(self):
        if not ghmmwrapper.SEQ_LABEL_FIELD:
            raise UnsupportedFeature("the seq_label field is obsolete. If you need it rebuild the GHMM with the conditional \"GHMM_OBSOLETE\".")
        return ghmmwrapper.long_array_getitem(self.cseq.seq_label,0)

    def setSeqLabel(self,value):
        if not ghmmwrapper.SEQ_LABEL_FIELD:
            raise UnsupportedFeature("the seq_label field is obsolete. If you need it rebuild the GHMM with the conditional \"GHMM_OBSOLETE\".")
        ghmmwrapper.long_array_setitem(self.cseq.seq_label,0,value)

    def getStateLabel(self):
        """
        @returns the labeling of the sequence in external representation
        """
        if self.cseq.state_labels != None:
            iLabel = ghmmwrapper.int_array2list(self.cseq.getLabels(0), self.cseq.getLabelsLength(0))
            return self.labelDomain.externalSequence(iLabel)
        else:
            raise IndexError(str(0) + " is out of bounds, only " + str(self.cseq.seq_number) + "labels")

    def hasStateLabels(self):
        """
        @returns whether the sequence is labeled or not
        """
        return self.cseq.state_labels != None

    def getGeneratingStates(self):
        """
        @returns the state path from which the sequence was generated as
        a Python list.
        """
        l_state = []
        for j in range(ghmmwrapper.int_array_getitem(self.cseq.states_len,0) ):
            l_state.append(ghmmwrapper.int_matrix_getitem(self.cseq.states,0,j))

        return l_state

    def __str__(self):
        "Defines string representation."
        seq = self.cseq
        strout = []

        l = seq.getLength(0)
        if l <= 80:

            for j in range(l):
                strout.append(str( self.emissionDomain.external(self[j]) )   )
                if self.emissionDomain.CDataType == "double":
                    strout.append(" ")
        else:

            for j in range(0,5):
                strout.append(str( self.emissionDomain.external(self[j]) )   )
                if self.emissionDomain.CDataType == "double":
                    strout.append(" ")
            strout.append('...')
            for j in range(l-5,l):
                strout.append(str( self.emissionDomain.external(self[j]) )   )
                if self.emissionDomain.CDataType == "double":
                    strout.append(" ")

        return join(strout,'')

    def verboseStr(self):
        "Defines string representation."
        seq = self.cseq
        strout = []
        strout.append("\nEmissionSequence Instance:\nlength " + str(seq.getLength(0)))
        strout.append(", weight " + str(seq.getWeight(0))  + ":\n")
        for j in range(seq.getLength(0)):
            strout.append(str( self.emissionDomain.external(self[j]) )   )
            if self.emissionDomain.CDataType == "double":
                strout.append(" ")

        # checking for labels
        if self.emissionDomain.CDataType == "int" and self.cseq.state_labels != None:
            strout.append("\nState labels:\n")
            for j in range(seq.getLabelsLength(0)):
                strout.append(str( self.labelDomain.external(ghmmwrapper.int_matrix_getitem(seq.state_labels,0,j)))+ ", ")

        return join(strout,'')


    def sequenceSet(self):
        """
        @return a one-element SequenceSet with this sequence.
        """

        # in order to copy the sequence in 'self', we first create an empty SequenceSet and then
        # add 'self'
        seqSet = SequenceSet(self.emissionDomain, [])
        seqSet.cseq.add(self.cseq)
        return seqSet

    def write(self,fileName):
        "Writes the EmissionSequence into file 'fileName'."
        self.cseq.write(fileName)

    def setWeight(self, value):
        self.cseq.setWeight(0, value)
        self.cseq.total_w  = value

    def getWeight(self):
        return self.cseq.getWeight(0)

    def asSequenceSet(self):
        """
        @returns this EmissionSequence as a one element SequenceSet
        """
        log.debug("EmissionSequence.asSequenceSet() -- begin " + repr(self.cseq))
        seq = self.sequenceAllocationFunction(1)

        # checking for state labels in the source C sequence struct
        if self.emissionDomain.CDataType == "int" and self.cseq.state_labels is not None:
            log.debug("EmissionSequence.asSequenceSet() -- found labels !")
            seq.calloc_state_labels()
            self.cseq.copyStateLabel(0, seq, 0)

        seq.setLength(0, self.cseq.getLength(0))
        seq.setSequence(0, self.cseq.getSequence(0))
        seq.setWeight(0, self.cseq.getWeight(0))

        log.debug("EmissionSequence.asSequenceSet() -- end " + repr(seq))
        return SequenceSetSubset(self.emissionDomain, seq, self)


class SequenceSet(object):
    """ A SequenceSet contains the *internal* representation of a number of
    sequences of emissions.

    It also contains a reference to the domain where the emissions orginated from.
    """

    def __init__(self, emissionDomain, sequenceSetInput, labelDomain = None, labelInput = None):
        """
        @p sequenceSetInput is a set of sequences from @p emissionDomain.

        There are several valid types for @p sequenceSetInput:
        - if @p sequenceSetInput is a string, it is interpreted as the filename
          of a sequence file to be read. File format should be fasta.
        - if @p sequenceSetInput is a list, it is considered as a list of lists
          containing the input sequences
        - @p sequenceSetInput can also be a pointer to a C sequence struct but
          this is only meant for internal use

        """
        self.emissionDomain = emissionDomain
        self.cseq = None

        if self.emissionDomain.CDataType == "int":
            # necessary C functions for accessing the ghmm_dseq struct
            self.sequenceAllocationFunction = ghmmwrapper.ghmm_dseq
            self.allocSingleSeq = ghmmwrapper.int_array_alloc
            #obsolete
            if ghmmwrapper.ASCI_SEQ_FILE:
                self.seq_read = ghmmwrapper.ghmm_dseq_read
            self.seq_ptr_array_getitem = ghmmwrapper.dseq_ptr_array_getitem
            self.sequence_cmatrix = ghmmhelper.list2int_matrix
        elif self.emissionDomain.CDataType == "double":
            # necessary C functions for accessing the ghmm_cseq struct
            self.sequenceAllocationFunction = ghmmwrapper.ghmm_cseq
            self.allocSingleSeq = ghmmwrapper.double_array_alloc
            #obsolete
            if ghmmwrapper.ASCI_SEQ_FILE:
                self.seq_read = ghmmwrapper.ghmm_cseq_read
            self.seq_ptr_array_getitem = ghmmwrapper.cseq_ptr_array_getitem
            self.sequence_cmatrix = ghmmhelper.list2double_matrix
        else:
            raise NoValidCDataType("C data type " + str(self.emissionDomain.CDataType) + " invalid.")


        # reads in the first sequence struct in the input file
        if isinstance(sequenceSetInput, str) or isinstance(sequenceSetInput, str):
            if sequenceSetInput[-3:] == ".fa" or sequenceSetInput[-6:] == ".fasta":
                # assuming FastA file:
                alfa = emissionDomain.toCstruct()
                cseq = ghmmwrapper.ghmm_dseq(sequenceSetInput, alfa)
                if cseq is None:
                    raise ParseFileError("invalid FastA file: " + sequenceSetInput)
                self.cseq = cseq
            # check if ghmm is build with asci sequence file support
            elif not ghmmwrapper.ASCI_SEQ_FILE:
                raise UnsupportedFeature("asci sequence files are deprecated. \
                Please convert your files to the new xml-format or rebuild the GHMM \
                with the conditional \"GHMM_OBSOLETE\".")
            else:
                if not os.path.exists(sequenceSetInput):
                    raise IOError('File ' + str(sequenceSetInput) + ' not found.')
                else:
                    tmp = self.seq_read(sequenceSetInput)
                    if len(tmp) > 0:
                        self.cseq = ghmmwrapper.ghmm_cseq(tmp[0])
                    else:
                        raise ParseFileError('File ' + str(sequenceSetInput) + ' not valid.')

        elif isinstance(sequenceSetInput, list):
            internalInput = [self.emissionDomain.internalSequence(seq) for seq in sequenceSetInput]
            (seq, lengths) = self.sequence_cmatrix(internalInput)
            lens = ghmmwrapper.list2int_array(lengths)

            self.cseq = self.sequenceAllocationFunction(seq, lens, len(sequenceSetInput))

            if isinstance(labelInput, list) and isinstance(labelDomain, LabelDomain):
                assert len(sequenceSetInput)==len(labelInput), "no. of sequences and labels do not match."

                self.labelDomain = labelDomain
                internalLabels = [self.labelDomain.internalSequence(oneLabel) for oneLabel in labelInput]
                (label,labellen) = ghmmhelper.list2int_matrix(internalLabels)
                lens = ghmmwrapper.list2int_array(labellen)
                self.cseq.init_labels(label, lens)

        #internal use
        elif isinstance(sequenceSetInput, ghmmwrapper.ghmm_dseq) or isinstance(sequenceSetInput, ghmmwrapper.ghmm_cseq):
            log.debug("SequenceSet.__init__()" + str(sequenceSetInput))
            self.cseq = sequenceSetInput
            if labelDomain is not None:
                self.labelDomain = labelDomain

        else:
            raise UnknownInputType("inputType " + str(type(sequenceSetInput)) + " not recognized.")


    def __del__(self):
        "Deallocation of C sequence struct."
        log.debug( "__del__ SequenceSet " + str(self.cseq))


    def __str__(self):
        "Defines string representation."
        seq = self.cseq
        strout =  ["SequenceSet (N=" + str(seq.seq_number)+")"]


        if seq.seq_number <= 6:
            iter_list = list(range(seq.seq_number))
        else:
            iter_list = [0,1,'X',seq.seq_number-2,seq.seq_number-1]


        for i in iter_list:
            if i == 'X':
                strout.append('\n\n   ...\n')
            else:
                strout.append("\n  seq " + str(i)+ "(len=" + str(seq.getLength(i)) + ")\n")
                strout.append('    '+str(self[i]))


        return join(strout,'')


    def verboseStr(self):
        "Defines string representation."
        seq = self.cseq
        strout =  ["\nNumber of sequences: " + str(seq.seq_number)]

        for i in range(seq.seq_number):
            strout.append("\nSeq " + str(i)+ ", length " + str(seq.getLength(i)))
            strout.append(", weight " + str(seq.getWeight(i))  + ":\n")
            for j in range(seq.getLength(i)):
                if self.emissionDomain.CDataType == "int":
                    strout.append(str( self.emissionDomain.external(( ghmmwrapper.int_matrix_getitem(self.cseq.seq, i, j) )) ))
                elif self.emissionDomain.CDataType == "double":
                    strout.append(str( self.emissionDomain.external(( ghmmwrapper.double_matrix_getitem(self.cseq.seq, i, j) )) ) + " ")

            # checking for labels
            if self.emissionDomain.CDataType == "int" and self.cseq.state_labels != None:
                strout.append("\nState labels:\n")
                for j in range(seq.getLabelsLength(i)):
                    strout.append(str( self.labelDomain.external(ghmmwrapper.int_matrix_getitem(seq.state_labels,i,j))) +", ")

        return join(strout,'')


    def __len__(self):
        """
        @returns the number of sequences in the SequenceSet.
        """
        return self.cseq.seq_number

    def sequenceLength(self, i):
        """
        @returns the lenght of sequence 'i' in the SequenceSet
        """
        return self.cseq.getLength(i)

    def getWeight(self, i):
        """
        @returns the weight of sequence i. @note Weights are used in Baum-Welch
        """
        return self.cseq.getWeight(i)

    def setWeight(self, i, w):
        """
        Set the weight of sequence i. @note Weights are used in Baum-Welch
        """
        ghmmwrapper.double_array_setitem(self.cseq.seq_w, i, w)

    def __getitem__(self, index):
        """
        @returns an EmissionSequence object initialized with a reference to
        sequence 'index'.
        """
        # check the index for correct range
        if index >= self.cseq.seq_number:
            raise IndexError

        seq = self.cseq.get_singlesequence(index)
        return EmissionSequence(self.emissionDomain, seq, ParentSequenceSet=self)


    def getSeqLabel(self,index):
        if not ghmmwrapper.SEQ_LABEL_FIELD:
            raise UnsupportedFeature("the seq_label field is obsolete. If you need it rebuild the GHMM with the conditional \"GHMM_OBSOLETE\".")
        return ghmmwrapper.long_array_getitem(self.cseq.seq_label,index)

    def setSeqLabel(self,index,value):
        if not ghmmwrapper.SEQ_LABEL_FIELD:
            raise UnsupportedFeature("the seq_label field is obsolete. If you need it rebuild the GHMM with the conditional \"GHMM_OBSOLETE\".")
        ghmmwrapper.long_array_setitem(self.cseq.seq_label,index,value)

    def getGeneratingStates(self):
        """
        @returns the state paths from which the sequences were generated as a
        Python list of lists.
        """
        states_len = ghmmwrapper.int_array2list(self.cseq.states_len, len(self))
        l_state = []
        for i, length in enumerate(states_len):
            col = ghmmwrapper.int_matrix_get_col(self.cseq.states, i)
            l_state.append(ghmmwrapper.int_array2list(col, length))

        return l_state


    def getSequence(self, index):
        """
        @returns the index-th sequence in internal representation
        """
        seq = []
        if self.cseq.seq_number > index:
            for j in range(self.cseq.getLength(index)):
                seq.append(self.cseq.getSymbol(index, j))
            return seq
        else:
            raise IndexError(str(index) + " is out of bounds, only " + str(self.cseq.seq_number) + "sequences")

    def getStateLabel(self,index):
        """
        @returns the labeling of the index-th sequence in internal representation
        """
        label = []
        if self.cseq.seq_number > index and self.cseq.state_labels != None:
            for j in range(self.cseq.getLabelsLength(index)):
                label.append(self.labelDomain.external(ghmmwrapper.int_matrix_getitem(self.cseq.state_labels, index, j)))
            return label
        else:
            raise IndexError(str(0) + " is out of bounds, only " + str(self.cseq.seq_number) + "labels")

    def hasStateLabels(self):
        """
        @returns whether the sequence is labeled or not
        """
        return self.cseq.state_labels != None


    def merge(self, emissionSequences): # Only allow EmissionSequence?
        """
        Merges 'emissionSequences' into 'self'.
        @param emissionSequences can either be an EmissionSequence or SequenceSet
        object.
        """

        if not isinstance(emissionSequences,EmissionSequence) and not isinstance(emissionSequences,SequenceSet):
            raise TypeError("EmissionSequence or SequenceSet required, got " + str(emissionSequences.__class__.__name__))

        self.cseq.add(emissionSequences.cseq)
        del(emissionSequences) # removing merged sequences

    def getSubset(self, seqIndixes):
        """
        @returns a SequenceSet containing (references to) the sequences with the
        indices in 'seqIndixes'.
        """
        seqNumber = len(seqIndixes)
        seq = self.sequenceAllocationFunction(seqNumber)

        # checking for state labels in the source C sequence struct
        if self.emissionDomain.CDataType == "int" and self.cseq.state_labels is not None:

            log.debug( "SequenceSet: found labels !")
            seq.calloc_state_labels()

        for i,seq_nr in enumerate(seqIndixes):
            len_i = self.cseq.getLength(seq_nr)
            seq.setSequence(i, self.cseq.getSequence(seq_nr))
            seq.setLength(i, len_i)
            seq.setWeight(i, self.cseq.getWeight(i))

            # setting labels if appropriate
            if self.emissionDomain.CDataType == "int" and self.cseq.state_labels is not None:
                self.cseq.copyStateLabel(seqIndixes[i], seq, seqIndixes[i])

        seq.seq_number = seqNumber

        return SequenceSetSubset(self.emissionDomain, seq, self)

    def write(self,fileName):
        "Writes (appends) the SequenceSet into file 'fileName'."
        self.cseq.write(fileName)

    def asSequenceSet(self):
        """convenience function, returns only self"""
        return self

class SequenceSetSubset(SequenceSet):
    """
    SequenceSetSubset contains a subset of the sequences from a SequenceSet
    object.

    @note On the C side only the references are used.
    """
    def __init__(self, emissionDomain, sequenceSetInput, ParentSequenceSet , labelDomain = None, labelInput = None):
        # reference on the parent SequenceSet object
        log.debug("SequenceSetSubset.__init__ -- begin -" +  str(ParentSequenceSet))
        self.ParentSequenceSet = ParentSequenceSet
        SequenceSet.__init__(self, emissionDomain, sequenceSetInput, labelDomain, labelInput)

    def __del__(self):
        """ Since we do not want to deallocate the sequence memory,
        the destructor has to be overloaded.
        """
        log.debug( "__del__ SequenceSubSet " + str(self.cseq))

        if self.cseq is not None:
            self.cseq.subseq_free()

        # remove reference on parent SequenceSet object
        self.ParentSequenceSet = None
        self.cseq.thisown = 0



def SequenceSetOpen(emissionDomain, fileName):
    # XXX Name doof
    """ Reads a sequence file with multiple sequence sets.

    @returns a list of SequenceSet objects.

    """
    #checks if supports asci sequence files, deprecated
    if not ghmmwrapper.ASCI_SEQ_FILE:
        raise UnsupportedFeature("asci sequence files are deprecated. Please convert your files"
                                       + " to the new xml-format or rebuild the GHMM with"
                                       + " the conditional \"GHMM_OBSOLETE\".")


    if not os.path.exists(fileName):
        raise IOError('File ' + str(fileName) + ' not found.')

    if emissionDomain.CDataType == "int":
        seq_read_func_ptr = ghmmwrapper.ghmm_dseq_read
        seq_ctor_func_ptr = ghmmwrapper.ghmm_dseq
    elif emissionDomain.CDataType == "double":
        seq_read_func_ptr = ghmmwrapper.ghmm_cseq_read
        seq_ctor_func_ptr = ghmmwrapper.ghmm_cseq
    else:
        raise TypeError("Invalid c data type " + str(emissionDomain.CDataType))

    seqs = seq_read_func_ptr(fileName)
    # ugly workaround for swig bug. swig is not always creating a proxy class
    seqs = [seq_ctor_func_ptr(ptr) for ptr in seqs]
    sequenceSets = [SequenceSet(emissionDomain, seq_ptr) for seq_ptr in seqs]
    return sequenceSets


def writeToFasta(seqSet,fn):
    """
    Writes a SequenceSet into a fasta file.
    """
    if not isinstance(seqSet, SequenceSet):
        raise TypeError("SequenceSet expected.")
    f = open(fn,'w')

    for i in range(len(seqSet)):
        rseq = []
        for j in range(seqSet.sequenceLength(i)):
            rseq.append(str(seqSet.emissionDomain.external(
                ghmmwrapper.int_matrix_getitem(seqSet.cseq.seq, i, j)
                )))

        f.write('>seq'+str(i)+'\n')
        f.write(fill(join(rseq,'') ))
        f.write('\n')

    f.close()



#-------------------------------------------------------------------------------
# HMMFactory and derived  -----------------------------------------------------
class HMMFactory(object):
    """ A HMMFactory is the base class of HMM factories.
        A HMMFactory has just a constructor and a call method
    """


GHMM_FILETYPE_SMO = 'smo' #obsolete
GHMM_FILETYPE_XML = 'xml'
GHMM_FILETYPE_HMMER = 'hmm'

class HMMOpenFactory(HMMFactory):
    """ Opens a HMM from a file.

    Currently four formats are supported:
    HMMer, our smo file format, and two xml formats.
    @note the support for smo files and the old xml format will phase out
    """
    def __init__(self, defaultFileType=None):
        self.defaultFileType = defaultFileType

    def guessFileType(self, filename):
        """ guesses the file format from the filename """
        if filename.endswith('.'+GHMM_FILETYPE_XML):
            return GHMM_FILETYPE_XML
        elif filename.endswith('.'+GHMM_FILETYPE_SMO):#obsolete
            return GHMM_FILETYPE_SMO
        elif filename.endswith('.'+GHMM_FILETYPE_HMMER):#obsolete
            return GHMM_FILETYPE_HMMER
        else:
            return None

    def __call__(self, fileName, modelIndex=None, filetype=None):

        if not isinstance(fileName,io.StringIO):
            if not os.path.exists(fileName):
                raise IOError('File ' + str(fileName) + ' not found.')

        if not filetype:
            if self.defaultFileType:
                log.warning("HMMOpenHMMER, HMMOpenSMO and HMMOpenXML are deprecated. "
                            + "Use HMMOpen and the filetype parameter if needed.")
                filetype = self.defaultFileType
            else:
                filetype = self.guessFileType(fileName)
            if not filetype:
                raise WrongFileType("Could not guess the type of file " + str(fileName)
                                    + " and no filetype specified")

        # XML file: both new and old format
        if filetype == GHMM_FILETYPE_XML:
            # try to validate against ghmm.dtd
            if ghmmwrapper.ghmm_xmlfile_validate(fileName):
                return self.openNewXML(fileName, modelIndex)
            else:
                return self.openOldXML(fileName)
        elif filetype == GHMM_FILETYPE_SMO:
            return self.openSMO(fileName, modelIndex)
        elif filetype == GHMM_FILETYPE_HMMER:
            return self.openHMMER(fileName)
        else:
            raise TypeError("Invalid file type " + str(filetype))


    def openNewXML(self, fileName, modelIndex):
        """ Open one ore more HMM in the new xml format """
        # opens and parses the file
        file = ghmmwrapper.ghmm_xmlfile_parse(fileName)
        if file == None:
            log.debug( "XML has file format problems!")
            raise WrongFileType("file is not in GHMM xml format")

        nrModels = file.noModels
        modelType = file.modelType

        # we have a continuous HMM, prepare for hmm creation
        if (modelType & ghmmwrapper.kContinuousHMM):
            emission_domain = Float()
            if (modelType & ghmmwrapper.kMultivariate):
                distribution = MultivariateGaussianDistribution
                hmmClass = MultivariateGaussianMixtureHMM
            else:
                distribution = ContinuousMixtureDistribution
                hmmClass = ContinuousMixtureHMM
            getModel = file.get_cmodel

        # we have a discrete HMM, prepare for hmm creation
        elif ((modelType & ghmmwrapper.kDiscreteHMM)
              and not (modelType & ghmmwrapper.kTransitionClasses)
              and not (modelType & ghmmwrapper.kPairHMM)):
            emission_domain = 'd'
            distribution = DiscreteDistribution
            getModel = file.get_dmodel
            if (modelType & ghmmwrapper.kLabeledStates):
                hmmClass = StateLabelHMM
            else:
                hmmClass = DiscreteEmissionHMM

        # currently not supported
        else:
            raise UnsupportedFeature("Non-supported model type")


        # read all models to list at first
        result = []
        for i in range(nrModels):
            cmodel = getModel(i)
            if emission_domain == 'd':
                emission_domain = Alphabet(cmodel.alphabet)
            if modelType & ghmmwrapper.kLabeledStates:
                labelDomain = LabelDomain(cmodel.label_alphabet)
                m = hmmClass(emission_domain, distribution(emission_domain), labelDomain, cmodel)
            else:
                m = hmmClass(emission_domain, distribution(emission_domain), cmodel)

            result.append(m)

        # for a single
        if modelIndex != None:
            if modelIndex < nrModels:
                result = result[modelIndex]
            else:
                raise IndexError("the file %s has only %s models"% fileName, str(nrModels))
        elif nrModels == 1:
            result = result[0]

        return result
    #obsolete
    def openOldXML(self, fileName):
        from ghmm_gato import xmlutil
        hmm_dom = xmlutil.HMM(fileName)
        emission_domain = hmm_dom.AlphabetType()

        if emission_domain == int:
            [alphabets, A, B, pi, state_orders] = hmm_dom.buildMatrices()

            emission_domain = Alphabet(alphabets)
            distribution = DiscreteDistribution(emission_domain)
            # build adjacency list

            # check for background distributions
            (background_dist, orders, code2name) = hmm_dom.getBackgroundDist()
            # (background_dist, orders) = hmm_dom.getBackgroundDist()
            bg_list = []
            # if background distribution exists, set background distribution here
            if background_dist != {}:
                # transformation to list for input into BackgroundDistribution,
                # ensure the rigth order
                for i in range(len(list(code2name.keys()))-1):
                    bg_list.append(background_dist[code2name[i]])

                bg = BackgroundDistribution(emission_domain, bg_list)

            # check for state labels
            (label_list, labels) = hmm_dom.getLabels()
            if labels == ['None']:
                labeldom   = None
                label_list = None
            else:
                labeldom = LabelDomain(labels)

            m = HMMFromMatrices(emission_domain, distribution, A, B, pi, None, labeldom, label_list)

            # old xml is discrete, set appropiate flag
            m.cmodel.addModelTypeFlags(ghmmwrapper.kDiscreteHMM)

            if background_dist != {}:
                ids = [-1]*m.N
                for s in list(hmm_dom.state.values()):
                    ids[s.index-1] = s.background # s.index ranges from [1, m.N]

                m.setBackground(bg, ids)
                log.debug( "model_type %x" % m.cmodel.model_type)
                log.debug("background_id" + str( ghmmwrapper.int_array2list(m.cmodel.background_id, m.N)))
            else:
                m.cmodel.bp = None
                m.cmodel.background_id = None

            # check for tied states
            tied = hmm_dom.getTiedStates()
            if len(tied) > 0:
                m.setFlags(kTiedEmissions)
                m.cmodel.tied_to = ghmmwrapper.list2int_array(tied)

            durations = hmm_dom.getStateDurations()
            if len(durations) == m.N:
                log.debug("durations: " + str(durations))
                m.extendDurations(durations)

            return m
    #obsolete
    def openSMO(self, fileName, modelIndex):
        # MO & SMO Files, format is deprecated
        # check if ghmm is build with smo support
        if not ghmmwrapper.SMO_FILE_SUPPORT:
            raise UnsupportedFeature("smo files are deprecated. Please convert your files"
                                      "to the new xml-format or rebuild the GHMM with the"
                                      "conditional \"GHMM_OBSOLETE\".")

        (hmmClass, emission_domain, distribution) = self.determineHMMClass(fileName)

        log.debug("determineHMMClass = "+ str(  (hmmClass, emission_domain, distribution)))

        # XXX broken since silent states are not supported by .smo file format
        if hmmClass == DiscreteEmissionHMM:
            models = ghmmwrapper.ghmm_dmodel_read(fileName)
            base_model_type = ghmmwrapper.KDiscreteHMM
        else:
            models = ghmmwrapper.ghmm_cmodel_read(fileName)
            base_model_type = ghmmwrapper.kContinuousHMM

        if modelIndex == None:
            result = []
            for cmodel in models:
                # ugly workaround for SWIG not creating a proxy class
                cmodel = ghmmwrapper.ghmm_cmodel(cmodel)
                cmodel.addModelTypeFlags(base_model_type)
                m = hmmClass(emission_domain, distribution(emission_domain), cmodel)
                result.append(m)
        else:
            if modelIndex < nrModels:
                cmodel = models[modelIndex]
                cmodel.addModelTypeFlags(base_model_type)
                result = hmmClass(emission_domain, distribution(emission_domain), cmodel)
            else:
                raise IndexError(fileName + "has only " + len(models) + "models")

        return result

    def openSingleHMMER(self, fileName):
        # HMMER format models
        h = modhmmer.hmmer(fileName)

        if h.m == 4:  # DNA model
            emission_domain = DNA
        elif h.m == 20:   # Peptide model
            emission_domain = AminoAcids
        else:   # some other model
            emission_domain = IntegerRange(0,h.m)
        distribution = DiscreteDistribution(emission_domain)

        # XXX TODO: Probably slow for large matrices (Rewrite for 0.9)
        [A,B,pi,modelName] = h.getGHMMmatrices()
        return  HMMFromMatrices(emission_domain, distribution, A, B, pi, hmmName=modelName)


    def openHMMER(self, fileName):
        """
        Reads a file containing multiple HMMs in HMMER format, returns list of
        HMM objects or a single HMM object.
        """
        if not os.path.exists(fileName):
            raise IOError('File ' + str(fileName) + ' not found.')

        modelList = []
        string = ""
        f = open(fileName,"r")

        res = re.compile("^//")
        stat = re.compile(r"^ACC\s+(\w+)")
        for line in f.readlines():
            string = string + line
            m = stat.match(line)
            if m:
                name = m.group(1)
                log.info( "Reading model " + str(name) + ".")

            match = res.match(line)
            if match:
                fileLike = io.StringIO(string)
                modelList.append(self.openSingleHMMER(fileLike))
                string = ""
                match = None

        if len(modelList) == 1:
            return modelList[0]
        return modelList


    def determineHMMClass(self, fileName):
        #
        # smo files. Obsolete
        #
        file = open(fileName,'r')

        hmmRe = re.compile(r"^HMM\s*=")
        shmmRe = re.compile(r"^SHMM\s*=")
        mvalueRe = re.compile(r"M\s*=\s*([0-9]+)")
        densityvalueRe = re.compile(r"density\s*=\s*([0-9]+)")
        cosvalueRe = re.compile(r"cos\s*=\s*([0-9]+)")
        emission_domain = None

        while 1:
            l = file.readline()
            if not l:
                break
            l = l.strip()
            if len(l) > 0 and l[0] != '#': # Not a comment line
                hmm = hmmRe.search(l)
                shmm = shmmRe.search(l)
                mvalue = mvalueRe.search(l)
                densityvalue = densityvalueRe.search(l)
                cosvalue = cosvalueRe.search(l)

                if hmm != None:
                    if emission_domain != None and emission_domain != 'int':
                        log.error( "HMMOpenFactory:determineHMMClass: both HMM and SHMM? " + str(emission_domain))
                    else:
                        emission_domain = 'int'

                if shmm != None:
                    if emission_domain != None and emission_domain != 'double':
                        log.error( "HMMOpenFactory:determineHMMClass: both HMM and SHMM? " + str(emission_domain))
                    else:
                        emission_domain = 'double'

                if mvalue != None:
                    M = int(mvalue.group(1))

                if densityvalue != None:
                    density = int(densityvalue.group(1))

                if cosvalue != None:
                    cos = int(cosvalue.group(1))

        file.close()
        if emission_domain == 'int':
            # only integer alphabet
            emission_domain = IntegerRange(0,M)
            distribution = DiscreteDistribution
            hmm_class = DiscreteEmissionHMM
            return (hmm_class, emission_domain, distribution)

        elif emission_domain == 'double':
            # M        number of mixture components
            # density  component type
            # cos      number of state transition classes
            if M == 1 and density == 0:
                emission_domain = Float()
                distribution = GaussianDistribution
                hmm_class = GaussianEmissionHMM
                return (hmm_class, emission_domain, distribution)

            elif  M > 1 and density == 0:
                emission_domain = Float()
                distribution = GaussianMixtureDistribution
                hmm_class = GaussianMixtureHMM
                return (hmm_class, emission_domain, distribution)

            else:
                raise TypeError("Model type can not be determined.")

        return (None, None, None)

# the following three methods are deprecated
HMMOpenHMMER = HMMOpenFactory(GHMM_FILETYPE_HMMER) # read single HMMER model from file
HMMOpenSMO   = HMMOpenFactory(GHMM_FILETYPE_SMO)
HMMOpenXML   = HMMOpenFactory(GHMM_FILETYPE_XML)

# use only HMMOpen and specify the filetype if it can't guessed from the extension
HMMOpen      = HMMOpenFactory()


class HMMFromMatricesFactory(HMMFactory):
    """ @todo Document matrix formats """

    # XXX TODO: this should use the editing context
    def __call__(self, emissionDomain, distribution, A, B, pi, hmmName = None, labelDomain= None, labelList = None, densities = None):
        if isinstance(emissionDomain, Alphabet):

            if not emissionDomain == distribution.alphabet:
                raise TypeError("emissionDomain and distribution must be compatible")

            # checking matrix dimensions and argument validation, only some obvious errors are checked
            if not len(A) == len(A[0]):
                raise InvalidModelParameters("A is not quadratic.")
            if not len(pi) == len(A):
                raise InvalidModelParameters("Length of pi does not match length of A.")
            if not len(A) == len(B):
                raise InvalidModelParameters("Different number of entries in A and B.")

            if (labelDomain is None and labelList is not None) or (labelList is None and labelList is not None):
                raise InvalidModelParameters("Specify either both labelDomain and labelInput or neither.")

            if isinstance(distribution,DiscreteDistribution):
                # HMM has discrete emissions over finite alphabet: DiscreteEmissionHMM
                cmodel = ghmmwrapper.ghmm_dmodel(len(A), len(emissionDomain))

                # assign model identifier (if specified)
                if hmmName != None:
                    cmodel.name = hmmName
                else:
                    cmodel.name = ''

                states = ghmmwrapper.dstate_array_alloc(cmodel.N)
                silent_states = []
                tmpOrder = []

                #initialize states
                for i in range(cmodel.N):
                    state = ghmmwrapper.dstate_array_getRef(states, i)
                    # compute state order
                    if cmodel.M > 1:
                        order = math.log(len(B[i]), cmodel.M)-1
                    else:
                        order = len(B[i]) - 1

                    log.debug( "order in state " + str(i) + " = " + str(order) )
                    # check or valid number of emission parameters
                    order = int(order)
                    if  cmodel.M**(order+1) == len(B[i]):
                        tmpOrder.append(order)
                    else:
                        raise InvalidModelParameters("The number of " + str(len(B[i])) +
                                                     " emission parameters for state " +
                                                     str(i) + " is invalid. State order can not be determined.")

                    state.b = ghmmwrapper.list2double_array(B[i])
                    state.pi = pi[i]

                    if sum(B[i]) == 0.0:
                        silent_states.append(1)
                    else:
                        silent_states.append(0)

                    #set out probabilities
                    state.out_states, state.out_id, state.out_a = ghmmhelper.extract_out(A[i])

                    #set "in" probabilities
                    A_col_i = [x[i] for x in A]
                    # Numarray use A[,:i]
                    state.in_states, state.in_id, state.in_a = ghmmhelper.extract_out(A_col_i)
                    #fix probabilities in reestimation, else 0
                    state.fix = 0

                cmodel.s = states
                if sum(silent_states) > 0:
                    cmodel.model_type |= kSilentStates
                    cmodel.silent = ghmmwrapper.list2int_array(silent_states)

                cmodel.maxorder = max(tmpOrder)
                if cmodel.maxorder > 0:
                    log.debug( "Set kHigherOrderEmissions.")
                    cmodel.model_type |= kHigherOrderEmissions
                    cmodel.order = ghmmwrapper.list2int_array(tmpOrder)

                # initialize lookup table for powers of the alphabet size,
                # speeds up models with higher order states
                powLookUp = [1] * (cmodel.maxorder+2)
                for i in range(1,len(powLookUp)):
                    powLookUp[i] = powLookUp[i-1] * cmodel.M
                cmodel.pow_lookup = ghmmwrapper.list2int_array(powLookUp)

                # check for state labels
                if labelDomain is not None and labelList is not None:
                    if not isinstance(labelDomain,LabelDomain):
                        raise TypeError("LabelDomain object required.")

                    cmodel.model_type |= kLabeledStates
                    m = StateLabelHMM(emissionDomain, distribution, labelDomain, cmodel)
                    m.setLabels(labelList)
                    return m
                else:
                    return DiscreteEmissionHMM(emissionDomain, distribution, cmodel)
            else:
                raise GHMMError(type(distribution), "Not a valid distribution for Alphabet")

        elif isinstance(emissionDomain, Float):
            # determining number of transition classes
            cos = ghmmhelper.classNumber(A)
            if cos == 1:
                A = [A]

            cmodel = ghmmwrapper.ghmm_cmodel(len(A[0]), cos)
            log.debug("cmodel.cos = " + str(cmodel.cos))

            self.constructSwitchingTransitions(cmodel, pi, A)

            if isinstance(distribution, GaussianDistribution):
                #initialize emissions
                for i in range(cmodel.N):
                    state = ghmmwrapper.cstate_array_getRef(cmodel.s, i)
                    state.M = 1

                    # set up emission(s), density type is normal
                    emissions = ghmmwrapper.c_emission_array_alloc(1)
                    emission = ghmmwrapper.c_emission_array_getRef(emissions, 0)
                    emission.type = ghmmwrapper.normal
                    emission.dimension = 1
                    (mu, sigma) = B[i]
                    emission.mean.val = mu #mu = mue in GHMM C-lib.
                    emission.variance.val = sigma
                    emission.fixed = 0  # fixing of emission deactivated by default
                    emission.setDensity(0)

                    # append emission to state
                    state.e = emissions
                    state.c = ghmmwrapper.list2double_array([1.0])

                return GaussianEmissionHMM(emissionDomain, distribution, cmodel)

            elif isinstance(distribution, GaussianMixtureDistribution):
                # Interpretation of B matrix for the mixture case
                # (Example with three states and two components each):
                #  B = [
                #      [ ["mu11","mu12"],["sig11","sig12"],["w11","w12"]   ],
                #      [  ["mu21","mu22"],["sig21","sig22"],["w21","w22"]  ],
                #      [  ["mu31","mu32"],["sig31","sig32"],["w31","w32"]  ],
                #      ]

                log.debug( "*** mixture model")

                cmodel.M = len(B[0][0])

                #initialize states
                for i in range(cmodel.N):
                    state = ghmmwrapper.cstate_array_getRef(cmodel.s, i)
                    state.M = len(B[0][0])

                    # allocate arrays of emmission parameters
                    mu_list = B[i][0]
                    sigma_list = B[i][1]
                    weight_list = B[i][2]

                    state.c = ghmmwrapper.list2double_array(weight_list)

                    # set up emission(s), density type is normal
                    emissions = ghmmwrapper.c_emission_array_alloc(state.M)

                    for j in range(state.M):
                        emission = ghmmwrapper.c_emission_array_getRef(emissions, j)
                        emission.type = ghmmwrapper.normal
                        emission.dimension = 1
                        mu = mu_list[j]
                        sigma = sigma_list[j]
                        emission.mean.val = mu #mu = mue in GHMM C-lib.
                        emission.variance.val = sigma
                        emission.fixed = 0  # fixing of emission deactivated by default
                        emission.setDensity(0)

                    # append emissions to state
                    state.e = emissions

                return GaussianMixtureHMM(emissionDomain, distribution, cmodel)

            elif isinstance(distribution, ContinuousMixtureDistribution):
                # Interpretation of B matrix for the mixture case
                # (Example with three states and two components each):
                #  B = [
                #      [["mu11","mu12"], ["sig11","sig12"], ["a11","a12"], ["w11","w12"]],
                #      [["mu21","mu22"], ["sig21","sig22"], ["a21","a22"], ["w21","w22"]],
                #      [["mu31","mu32"], ["sig31","sig32"], ["a31","a32"], ["w31","w32"]],
                #      ]
                #
                # ghmmwrapper.uniform: mu = min, sig = max
                # ghmmwrapper.normal_right or ghmmwrapper.normal_left: a = cutoff

                log.debug( "*** general mixture model")

                cmodel.M = len(B[0][0])

                #initialize states
                for i in range(cmodel.N):
                    state = ghmmwrapper.cstate_array_getRef(cmodel.s, i)
                    state.M = len(B[i][0])

                    # set up emission(s), density type is normal
                    emissions = ghmmwrapper.c_emission_array_alloc(state.M)
                    weight_list = B[i][3]

                    combined_map = [(first, B[i][0][n], B[i][1][n], B[i][2][n])
                                    for n, first  in enumerate(densities[i])]

                    for j, parameters in enumerate(combined_map):
                        emission = ghmmwrapper.c_emission_array_getRef(emissions, j)
                        emission.type = densities[i][j]
                        emission.dimension = 1
                        if (emission.type == ghmmwrapper.normal
                            or emission.type == ghmmwrapper.normal_approx):
                            emission.mean.val = parameters[1]
                            emission.variance.val = parameters[2]
                        elif emission.type == ghmmwrapper.normal_right:
                            emission.mean.val = parameters[1]
                            emission.variance.val = parameters[2]
                            emission.min = parameters[3]
                        elif emission.type == ghmmwrapper.normal_left:
                            emission.mean.val = parameters[1]
                            emission.variance.val = parameters[2]
                            emission.max = parameters[3]
                        elif emission.type == ghmmwrapper.uniform:
                            emission.max = parameters[1]
                            emission.min = parameters[2]
                        else:
                            raise TypeError("Unknown Distribution type:" + str(emission.type))

                    # append emissions to state
                    state.e = emissions
                    state.c = ghmmwrapper.list2double_array(weight_list)

                return ContinuousMixtureHMM(emissionDomain, distribution, cmodel)

            elif isinstance(distribution, MultivariateGaussianDistribution):
                log.debug( "*** multivariate gaussian distribution model")

                # this is being extended to also support mixtures of multivariate gaussians
                # Interpretation of B matrix for the multivariate gaussian case
                # (Example with three states and two mixture components with two dimensions):
                #  B = [
                #       [["mu111","mu112"],["sig1111","sig1112","sig1121","sig1122"],
                #        ["mu121","mu122"],["sig1211","sig1212","sig1221","sig1222"],
                #        ["w11","w12"] ],
                #       [["mu211","mu212"],["sig2111","sig2112","sig2121","sig2122"],
                #        ["mu221","mu222"],["sig2211","sig2212","sig2221","sig2222"],
                #        ["w21","w22"] ],
                #       [["mu311","mu312"],["sig3111","sig3112","sig3121","sig3122"],
                #        ["mu321","mu322"],["sig3211","sig3212","sig3221","sig3222"],
                #        ["w31","w32"] ],
                #      ]
                #
                # ["mu311","mu312"] is the mean vector of the two dimensional
                # gaussian in state 3, mixture component 1
                # ["sig1211","sig1212","sig1221","sig1222"] is the covariance
                # matrix of the two dimensional gaussian in state 1, mixture component 2
                # ["w21","w22"] are the weights of the mixture components
                # in state 2
                # For states with only one mixture component, a implicit weight
                # of 1.0 is assumed

                cmodel.addModelTypeFlags(ghmmwrapper.kMultivariate)
                cmodel.dim = len(B[0][0]) # all states must have same dimension

                #initialize states
                for i in range(cmodel.N):
                    # set up state parameterss
                    state = ghmmwrapper.cstate_array_getRef(cmodel.s, i)
                    state.M = len(B[i])/2
                    if state.M > cmodel.M:
                        cmodel.M = state.M

                    # multiple mixture components
                    if state.M > 1:
                        state.c = ghmmwrapper.list2double_array(B[i][state.M*2]) # Mixture weights.
                    else:
                        state.c = ghmmwrapper.list2double_array([1.0])

                    # set up emission(s), density type is normal
                    emissions = ghmmwrapper.c_emission_array_alloc(state.M) # M emission components in this state

                    for em in range(state.M):
                        emission = ghmmwrapper.c_emission_array_getRef(emissions,em)
                        emission.dimension = len(B[0][0]) # dimension must be same in all states and emissions
                        mu = B[i][em*2]
                        sigma = B[i][em*2+1]
                        emission.mean.vec = ghmmwrapper.list2double_array(mu)
                        emission.variance.mat = ghmmwrapper.list2double_array(sigma)
                        emission.sigmacd = ghmmwrapper.list2double_array(sigma) # just for allocating the space
                        emission.sigmainv = ghmmwrapper.list2double_array(sigma) # just for allocating the space
                        emission.fixed = 0  # fixing of emission deactivated by default
                        emission.setDensity(6)
                        # calculate inverse and determinant of covariance matrix
                        determinant = ghmmwrapper.list2double_array([0.0])
                        ghmmwrapper.ighmm_invert_det(emission.sigmainv, determinant,
                                                     emission.dimension, emission.variance.mat)
                        emission.det = ghmmwrapper.double_array_getitem(determinant, 0)

                    # append emissions to state
                    state.e = emissions

                return MultivariateGaussianMixtureHMM(emissionDomain, distribution, cmodel)

            else:
                raise GHMMError(type(distribution),
                                "Cannot construct model for this domain/distribution combination")
        else:
            raise TypeError("Unknown emission doamin" + str(emissionDomain))

    def constructSwitchingTransitions(self, cmodel, pi, A):
        """ @internal function: creates switching transitions """

        #initialize states
        for i in range(cmodel.N):

            state = ghmmwrapper.cstate_array_getRef(cmodel.s, i)
            state.pi = pi[i]

            #set out probabilities
            trans = ghmmhelper.extract_out_cos(A, cmodel.cos, i)
            state.out_states = trans[0]
            state.out_id = trans[1]
            state.out_a = trans[2]

            #set "in" probabilities
            trans = ghmmhelper.extract_in_cos(A,cmodel.cos, i)
            state.in_states = trans[0]
            state.in_id = trans[1]
            state.in_a = trans[2]


HMMFromMatrices = HMMFromMatricesFactory()

#-------------------------------------------------------------------------------
#- Background distribution

class BackgroundDistribution(object):
    """ Background distributions object

        holds discrete distributions used as background while training
        discrete HMMs to avoid overfitting.
        Input is a discrete EmissionDomain and a list of list. Each list is
        a distinct distribution. The distributions can be of higher order.
        The length of a single distribution is a power of len(EmissionDomain)
    """
    def __init__(self, emissionDomain, bgInput):

        if type(bgInput) == list:
            self.emissionDomain = emissionDomain
            distNum = len(bgInput)

            order = ghmmwrapper.int_array_alloc(distNum)
            b = ghmmwrapper.double_matrix_alloc_row(distNum)
            for i in range(distNum):
                if len(emissionDomain) > 1:
                    o = math.log(len(bgInput[i]), len(emissionDomain)) - 1
                else:
                    o = len(bgInput[i]) - 1

                assert (o % 1) == 0, "Invalid order of distribution " + str(i) + ": " + str(o)

                ghmmwrapper.int_array_setitem(order, i, int(o))
                # dynamic allocation, rows have different lenghts
                b_i = ghmmwrapper.list2double_array(bgInput[i])
                ghmmwrapper.double_matrix_set_col(b, i, b_i)

            self.cbackground = ghmmwrapper.ghmm_dbackground(distNum, len(emissionDomain), order, b)
            self.name2id = dict()
        elif isinstance(bgInput, ghmmwrapper.ghmm_dbackground):
            self.cbackground = bgInput
            self.emissionDomain = emissionDomain
            self.name2id = dict()
            self.updateName2id()
        else:
            raise TypeError("Input type "+str(type(bgInput)) +" not recognized.")

    def __del__(self):
        log.debug( "__del__ BackgroundDistribution " + str(self.cbackground))
        del self.cbackground
        self.cbackground = None

    def __str__(self):
        outstr = 'BackgroundDistribution (N= '+str(self.cbackground.n)+'):\n'
        outstr += str(self.emissionDomain) + "\n"
        d = ghmmhelper.double_matrix2list(self.cbackground.b, self.cbackground.n, len(self.emissionDomain))
        outstr += "Distributions:\n"
        f = lambda x: "%.2f" % (x,)  # float rounding function

        for i in range(self.cbackground.n):
            if self.cbackground.getName(i) is not None:
                outstr +='  '+str(i+1) + ", name = " + self.cbackground.getName(i);
            else:
                outstr += '  '+str(i+1)
            outstr += " :(order= " + str(self.cbackground.getOrder(i))+ "): "
            outstr += " "+join(list(map(f,d[i])),', ')+"\n"
        return outstr


    def verboseStr(self):
        outstr = "BackgroundDistribution instance:\n"
        outstr += "Number of distributions: " + str(self.cbackground.n)+"\n\n"
        outstr += str(self.emissionDomain) + "\n"
        d = ghmmhelper.double_matrix2list(self.cbackground.b, self.cbackground.n, len(self.emissionDomain))
        outstr += "Distributions:\n"
        for i in range(self.cbackground.n):
            outstr += "  Order: " + str(self.cbackground.getOrder(i))+"\n"
            outstr += "  " + str(i+1) +": "+str(d[i])+"\n"
        return outstr

    def getCopy(self):
        return BackgroundDistribution(self.emissionDomain, self.cbackground.copy())

    def toLists(self):
        dim = self.cbackground.m
        distNum = self.cbackground.n
        orders = ghmmwrapper.int_array2list(self.cbackground.order, distNum)
        B = []
        for i in range(distNum):
            order = orders[i]
            size = int(pow(self.m,(order+1)))
            b = [0.0]*size
            for j in range(size):
                b[j] = ghmmwrapper.double_matrix_getitem(self.cbackground.b,i,j)
            B.append(b)
        return (distNum,orders,B)

    def getName(self, i):
        """return the name of the ith backgound distrubution"""
        if i < self.cbackground.n:
            return self.cbackground.getName(i)

    def setName(self, i, name):
        """sets the name of the ith background distrubution to name"""
        if i < self.cbackground.n:
            self.cbackground.setName(i, name)
            self.name2id[name] = i

    def updateName2id(self):
        """adds all background names to the dictionary name2id"""
        for i in range(self.cbackground.n):
            tmp = self.cbackground.getName(i)
            if tmp is not None:
                self.name2id[tmp] = i



#-------------------------------------------------------------------------------
#- HMM and derived
class HMM(object):
    """ The HMM base class.

    All functions where the C signatures allows it will be defined in here.
    Unfortunately there stil is a lot of overloading going on in derived classes.

    Generic features (these apply to all derived classes):
    - Forward algorithm
    - Viterbi algorithm
    - Baum-Welch training
    - HMM distance metric
    - ...

    """
    def __init__(self, emissionDomain, distribution, cmodel):
        self.emissionDomain = emissionDomain
        self.distribution = distribution
        self.cmodel = cmodel

        self.N = self.cmodel.N  # number of states
        self.M = self.cmodel.M  # number of symbols / mixture components
        self.name2id = dict()
        self.updateName2id()


    def __del__(self):
        """ Deallocation routine for the underlying C data structures. """
        log.debug( "__del__ HMM" + str(self.cmodel))

    def loglikelihood(self, emissionSequences):
        """ Compute log( P[emissionSequences| model]) using the forward algorithm
        assuming independence of the sequences in emissionSequences

        @param emissionSequences can either be a SequenceSet or a EmissionSequence

        @returns log( P[emissionSequences| model]) of type float which is
        computed as $sum_{s} log( P[s| model])$ when emissionSequences
        is a SequenceSet

        @note The implementation does not compute the full forward matrix since
        we are only interested in the likelihoods in this case.
        """
        return sum(self.loglikelihoods(emissionSequences))


    def loglikelihoods(self, emissionSequences):
        """ Compute a vector ( log( P[s| model]) )_{s} of log-likelihoods of the
        individual emission_sequences using the forward algorithm

        @param emissionSequences is of type SequenceSet

        @returns log( P[emissionSequences| model]) of type float
        (numarray) vector of floats

        """
        log.debug("HMM.loglikelihoods() -- begin")
        emissionSequences = emissionSequences.asSequenceSet()
        seqNumber = len(emissionSequences)

        likelihoodList = []

        for i in range(seqNumber):
            log.warning("\ngetting likelihood for sequence %i\n"%i)
            seq = emissionSequences.cseq.getSequence(i)
            tmp = emissionSequences.cseq.getLength(i)

            ret_val,likelihood = self.cmodel.logp(seq, tmp)
            if ret_val == -1:

                log.warning("forward returned -1: Sequence "+str(i)+" cannot be build.")
                # XXX TODO Eventually this should trickle down to C-level
                # Returning -DBL_MIN instead of infinity is stupid, since the latter allows
                # to continue further computations with that inf, which causes
                # things to blow up later.
                # cmodel.logp() could do without a return value if -Inf is returned
                # What should be the semantics in case of computing the likelihood of
                # a set of sequences
                likelihoodList.append(-float('Inf'))
            else:
                likelihoodList.append(likelihood)

        del emissionSequences
        log.debug("HMM.loglikelihoods() -- end")
        return likelihoodList

    # Further Marginals ...
    def pathPosterior(self, sequence, path):
        """
        @returns the log posterior probability for 'path' having generated
        'sequence'.

        @attention pathPosterior needs to calculate the complete forward and
        backward matrices. If you are interested in multiple paths it would
        be more efficient to use the 'posterior' function directly and not
        multiple calls to pathPosterior

        @todo for silent states things are more complicated -> to be done
        """
        # XXX TODO for silent states things are more complicated -> to be done
        if self.hasFlags(kSilentStates):
            raise NotImplementedError("Models with silent states not yet supported.")

        # calculate complete posterior matrix
        post = self.posterior(sequence)
        path_posterior = []

        if not self.hasFlags(kSilentStates):
            # if there are no silent states things are straightforward
            assert len(path) == len(sequence), "Path and sequence have different lengths"

            # appending posteriors for each element of path
            for p,state in enumerate(path):
                try:
                    path_posterior.append(post[p][state])
                except IndexError:
                    raise IndexError("Invalid state index " + str(state) + ". Model and path are incompatible")
            return path_posterior
#        # XXX TODO silent states are yet to be done
#        else:
#            # for silent state models we have to propagate the silent states in each column of the
#            # posterior matrix
#
#            assert not self.isSilent(path[0]), "First state in path must not be silent."
#
#            j = 0   # path index
#            for i in range(len(sequence)):
#                pp = post[i][path[j]]
#
#                print pp
#
#                if pp == 0:
#                    return float('-inf')
#                else:
#                    path_log_lik += math.log(post[p][path[p]])
#                    j+=1
#
#
#                # propagate path up until the next emitting state
#                while self.isSilent(path[j]):
#
#                    print "** silent state ",path[j]
#
#                    pp =  post[i][path[j]]
#                    if pp == 0:
#                        return float('-inf')
#                    else:
#                        path_log_lik += math.log(post[p][path[p]])
#                        j+=1
#
#            return path_log_lik

    def statePosterior(self, sequence, state, time):
        """
        @returns the log posterior probability for being at 'state'
        at time 'time' in 'sequence'.

        @attention: statePosterior needs to calculate the complete forward
        and backward matrices. If you are interested in multiple states
        it would be more efficient to use the posterior function directly
        and not multiple calls to statePosterior

        @todo for silent states things are more complicated -> to be done
        """
        # XXX TODO for silent states things are more complicated -> to be done
        if self.hasFlags(kSilentStates):
            raise NotImplementedError("Models with silent states not yet supported.")

        # checking function arguments
        if not 0 <= time < len(sequence):
            raise IndexError("Invalid sequence index: "+str(time)+" (sequence has length "+str(len(sequence))+" ).")
        if not 0 <= state < self.N:
            raise IndexError("Invalid state index: " +str(state)+ " (models has "+str(self.N)+" states ).")

        post = self.posterior(sequence)
        return post[time][state]


    def posterior(self, sequence):
        """ Posterior distribution matrix for 'sequence'.

        @todo for silent states things are more complicated -> to be done
        """
        # XXX TODO for silent states things are more complicated -> to be done
        if self.hasFlags(kSilentStates):
            raise NotImplementedError("Models with silent states not yet supported.")

        if not isinstance(sequence, EmissionSequence):
            raise TypeError("Input to posterior must be EmissionSequence object")

        (alpha,scale)  = self.forward(sequence)
        beta = self.backward(sequence,scale)

        return list(map(lambda v,w : list(map(lambda x,y : x*y, v, w)), alpha, beta))


    def joined(self, emissionSequence, stateSequence):
        """ log P[ emissionSequence, stateSequence| m] """

        if not isinstance(emissionSequence,EmissionSequence):
            raise TypeError("EmissionSequence required, got " + str(emissionSequence.__class__.__name__))

        seqdim = 1
        if emissionSequence.emissionDomain == Float():
            seqdim = emissionSequence.cseq.dim
            if seqdim < 1:
                seqdim = 1

        t = len(emissionSequence)
        s = len(stateSequence)

        if t/seqdim != s and not self.hasFlags(kSilentStates):
            raise IndexError("sequence and state sequence have different lengths " +
                             "but the model has no silent states.")

        seq = emissionSequence.cseq.getSequence(0)
        states = ghmmwrapper.list2int_array(stateSequence)

        err, logp = self.cmodel.logp_joint(seq, t, states, s)

        if err != 0:
            log.error("logp_joint finished with -1: EmissionSequence cannot be build under stateSequence.")
            return

        # deallocation
        ghmmwrapper.free(states)
        return logp

    # The functions for model training are defined in the derived classes.
    def baumWelch(self, trainingSequences, nrSteps=ghmmwrapper.MAX_ITER_BW, loglikelihoodCutoff=ghmmwrapper.EPS_ITER_BW):
        raise NotImplementedError("to be defined in derived classes")

    def baumWelchSetup(self, trainingSequences, nrSteps):
        raise NotImplementedError("to be defined in derived classes")

    def baumWelchStep(self, nrSteps, loglikelihoodCutoff):
        raise NotImplementedError("to be defined in derived classes")

    def baumWelchDelete(self):
        raise NotImplementedError("to be defined in derived classes")

    # extern double ghmm_c_prob_distance(smodel *cm0, smodel *cm, int maxT, int symmetric, int verbose);
    def distance(self, model, seqLength):
        """
        @returns the distance between 'self.cmodel' and 'model'.
        """
        return self.cmodel.prob_distance(model.cmodel, seqLength, 0, 0)


    def forward(self, emissionSequence):
        """
        @returns the (N x T)-matrix containing the forward-variables
        and the scaling vector
        """
        log.debug("HMM.forward -- begin")
        # XXX Allocations should be in try, except, finally blocks
        # to assure deallocation even in the case of errrors.
        # This will leak otherwise.
        seq = emissionSequence.cseq.getSequence(0)

        t = len(emissionSequence)
        calpha = ghmmwrapper.double_matrix_alloc(t, self.N)
        cscale = ghmmwrapper.double_array_alloc(t)

        error, unused = self.cmodel.forward(seq, t, calpha, cscale)
        if error == -1:
            log.error( "forward finished with -1: EmissionSequence cannot be build.")

        # translate alpha / scale to python lists
        pyscale = ghmmwrapper.double_array2list(cscale, t)
        pyalpha = ghmmhelper.double_matrix2list(calpha, t, self.N)

        # deallocation
        ghmmwrapper.free(cscale)
        ghmmwrapper.double_matrix_free(calpha, t)

        log.debug("HMM.forward -- end")
        return pyalpha, pyscale


    def backward(self, emissionSequence, scalingVector):
        """
        @returns the (N x T)-matrix containing the backward-variables
        """
        log.debug("HMM.backward -- begin")
        seq = emissionSequence.cseq.getSequence(0)

        # parsing 'scalingVector' to C double array.
        cscale = ghmmwrapper.list2double_array(scalingVector)

        # alllocating beta matrix
        t = len(emissionSequence)
        cbeta = ghmmwrapper.double_matrix_alloc(t, self.N)

        error = self.cmodel.backward(seq,t,cbeta,cscale)
        if error == -1:
            log.error( "backward finished with -1: EmissionSequence cannot be build.")

        pybeta = ghmmhelper.double_matrix2list(cbeta,t,self.N)

        # deallocation
        ghmmwrapper.free(cscale)
        ghmmwrapper.double_matrix_free(cbeta,t)

        log.debug("HMM.backward -- end")
        return pybeta


    def viterbi(self, eseqs):
        """ Compute the Viterbi-path for each sequence in emissionSequences

        @param eseqs can either be a SequenceSet or an EmissionSequence

        @returns [q_0, ..., q_T] the viterbi-path of p eseqs is an
        EmmissionSequence object,
        [[q_0^0, ..., q_T^0], ..., [q_0^k, ..., q_T^k]} for a k-sequence
        SequenceSet
        """
        log.debug("HMM.viterbi() -- begin")
        emissionSequences = eseqs.asSequenceSet()

        seqNumber = len(emissionSequences)

        allLogs = []
        allPaths = []
        for i in range(seqNumber):
            seq = emissionSequences.cseq.getSequence(i)
            seq_len = emissionSequences.cseq.getLength(i)

            if seq_len > 0:
                viterbiPath, pathlen, log_p = self.cmodel.viterbi(seq, seq_len)
            else:
                viterbiPath = None

            onePath = ghmmwrapper.int_array2list(viterbiPath, pathlen)
            allPaths.append(onePath)
            allLogs.append(log_p)
            ghmmwrapper.free(viterbiPath)

        log.debug("HMM.viterbi() -- end")
        if seqNumber > 1:
            return allPaths, allLogs
        else:
            return allPaths[0], allLogs[0]


    def sample(self, seqNr ,T, seed=0):
        """ Sample emission sequences.

        @param seqNr number of sequences to be sampled
        @param T maximal length of each sequence
        @param seed initialization value for rng, default 0 leaves the state
        of the rng alone
        @returns a SequenceSet object.
        """
        seqPtr = self.cmodel.generate_sequences(seed, T, seqNr, -1)
        return SequenceSet(self.emissionDomain, seqPtr)


    def sampleSingle(self, T, seed=0):
        """ Sample a single emission sequence of length at most T.

        @param T maximal length of the sequence
        @param seed initialization value for rng, default 0 leaves the state
        of the rng alone
        @returns a EmissionSequence object.
        """
        log.debug("HMM.sampleSingle() -- begin")
        seqPtr = self.cmodel.generate_sequences(seed, T, 1, -1)
        log.debug("HMM.sampleSingle() -- end")
        return EmissionSequence(self.emissionDomain, seqPtr)

    def getStateFix(self,state):
        state = self.state(state)
        s = self.cmodel.getState(state)
        return s.fix

    def setStateFix(self, state ,flag):
        state = self.state(state)
        s = self.cmodel.getState(state)
        s.fix = flag

    def clearFlags(self, flags):
        """ Clears one or more model type flags.
        @attention Use with care.
        """
        log.debug("clearFlags: " + self.printtypes(flags))
        self.cmodel.model_type &= ~flags

    def hasFlags(self, flags):
        """ Checks if the model has one or more model type flags set
        """
        return self.cmodel.model_type & flags

    def setFlags(self, flags):
        """ Sets one or more model type flags.
        @attention Use with care.
        """
        log.debug("setFlags: " + self.printtypes(flags))
        self.cmodel.model_type |= flags

    def state(self, stateLabel):
        """ Given a stateLabel return the integer index to the state

        """
        return self.name2id[stateLabel]

    def getInitial(self, i):
        """ Accessor function for the initial probability $pi_i$ """
        state = self.cmodel.getState(i)
        return state.pi

    def setInitial(self, i, prob, fixProb=False):
        """ Accessor function for the initial probability $pi_i$.

        If 'fixProb' = True $pi$ will be rescaled to 1 with 'pi[i]'
        fixed to the arguement value of 'prob'.

        """
        state = self.cmodel.getState(i)
        old_pi = state.pi
        state.pi = prob

        # renormalizing pi, pi(i) is fixed on value 'prob'
        if fixProb:
            coeff = (1.0 - old_pi) / prob
            for j in range(self.N):
                if i != j:
                    state = self.cmodel.getState(j)
                    p = state.pi
                    state.pi = p / coeff

    def getTransition(self, i, j):
        """ Accessor function for the transition a_ij """
        i = self.state(i)
        j = self.state(j)

        transition = self.cmodel.get_transition(i, j)
        if transition < 0.0:
            transition = 0.0
        return transition

    def setTransition(self, i, j, prob):
        """ Accessor function for the transition a_ij. """
        i = self.state(i)
        j = self.state(j)

        if not 0.0 <= prob <= 1.0:
            raise ValueError("Transition " + str(prop) + " is not a probability.")

        self.cmodel.set_transition(i, j, prob)


    def getEmission(self, i):
        """
        Accessor function for the emission distribution parameters of state 'i'.

        For discrete models the distribution over the symbols is returned,
        for continuous models a matrix of the form
        [ [mu_1, sigma_1, weight_1] ... [mu_M, sigma_M, weight_M]  ] is returned.

        """
        raise NotImplementedError

    def setEmission(self, i, distributionParemters):
        """ Set the emission distribution parameters

        Defined in derived classes.
         """
        raise NotImplementedError

    def asMatrices(self):
        "To be defined in derived classes."
        raise NotImplementedError


    def normalize(self):
        """ Normalize transition probs, emission probs (if applicable)
        """
        log.debug( "Normalizing now.")

        i_error = self.cmodel.normalize()
        if i_error == -1:
            log.error("normalization failed")

    def randomize(self, noiseLevel):
        """ to be defined in derived class """
        raise NotImplementedError

    def write(self,fileName):
        """ Writes HMM to file 'fileName'.

        """
        self.cmodel.write_xml(fileName)


    def printtypes(self, model_type):
        strout = []
        if model_type == kNotSpecified:
            return 'kNotSpecified'
        for k in list(types.keys()):
            if model_type & k:
                strout.append(types[k])
        return ' '.join(strout)

    def updateName2id(self):
        """adds all state names to the dictionary name2id"""
        for i in range(self.cmodel.N):
            self.name2id[i] = i
            if(self.cmodel.getStateName(i) != None):
                self.name2id[self.cmodel.getStateName(i)] = i

    def setStateName(self, index, name):
        """sets the state name of state index to name"""
        self.cmodel.setStateName(index, name)
        self.name2id[name] = index

    def getStateName(self, index):
        """returns the name of the state index"""
        return self.cmodel.getStateName(index)


def HMMwriteList(fileName, hmmList, fileType=GHMM_FILETYPE_XML):
    if (fileType == GHMM_FILETYPE_XML):
        if os.path.exists(fileName):
            log.warning( "HMMwriteList: File " + str(fileName) + " already exists. Model will be overwritted.")
        models = ghmmwrapper.cmodel_ptr_array_alloc(len(hmmList))
        for i, model in enumerate(hmmList):
            ghmmwrapper.cmodel_ptr_array_setitem(models, i, model.cmodel)
        ghmmwrapper.ghmm_cmodel_xml_write(models, fileName, len(hmmList))
        ghmmwrapper.free(models)
    elif (fileType==GHMM_FILETYPE_SMO):
        raise WrongFileType("the smo file format is deprecated, use xml instead")
    else:
        raise WrongFileType("unknown file format" + str(fileType))


class DiscreteEmissionHMM(HMM):
    """ HMMs with discrete emissions.

    Optional features:
    - silent states
    - higher order states
    - parameter tying in training
    - background probabilities in training
    """

    def __init__(self, emissionDomain, distribution, cmodel):
        HMM.__init__(self, emissionDomain, distribution, cmodel)

        self.model_type = self.cmodel.model_type  # model type
        self.maxorder = self.cmodel.maxorder
        self.background = None

    def __str__(self):
        hmm = self.cmodel
        strout = [str(self.__class__.__name__)]
        if self.cmodel.name:
            strout.append( " " + str(self.cmodel.name))
        strout.append(  "(N="+ str(hmm.N))
        strout.append(  ", M="+ str(hmm.M)+')\n')

        f = lambda x: "%.2f" % (x,) # float rounding function

        if self.hasFlags(kHigherOrderEmissions):
            order = ghmmwrapper.int_array2list(self.cmodel.order, self.N)
        else:
            order = [0]*hmm.N

        if hmm.N <= 4:
            iter_list = list(range(self.N))
        else:
            iter_list = [0,1,'X',hmm.N-2,hmm.N-1]

        for k in iter_list:
            if k == 'X':
                strout.append('\n  ...\n\n')
                continue

            state = hmm.getState(k)
            strout.append( "  state "+ str(k) +' (')
            if order[k] > 0:
                strout.append( 'order='+ str(order[k])+',')


            strout.append( "initial=" + f(state.pi)+')\n')
            strout.append( "    Emissions: ")
            for outp in range(hmm.M**(order[k]+1)):
                strout.append(f(ghmmwrapper.double_array_getitem(state.b,outp)))
                if outp < hmm.M**(order[k]+1)-1:
                    strout.append( ', ')
                else:
                    strout.append('\n')

            strout.append( "    Transitions:")
            #trans = [0.0] * hmm.N
            for i in range( state.out_states):
                strout.append( " ->" + str( state.getOutState(i))+' ('+ f(ghmmwrapper.double_array_getitem(state.out_a,i) ) +')' )
                if i < state.out_states-1:
                    strout.append( ',')
                #strout.append(" with probability " + str(ghmmwrapper.double_array_getitem(state.out_a,i)))

            strout.append('\n')

        return join(strout,'')



    def verboseStr(self):
        hmm = self.cmodel
        strout = ["\nGHMM Model\n"]
        strout.append( "Name: " + str(self.cmodel.name))
        strout.append( "\nModelflags: "+ self.printtypes(self.cmodel.model_type))
        strout.append(  "\nNumber of states: "+ str(hmm.N))
        strout.append(  "\nSize of Alphabet: "+ str(hmm.M))
        if self.hasFlags(kHigherOrderEmissions):
            order = ghmmwrapper.int_array2list(self.cmodel.order, self.N)
        else:
            order = [0]*hmm.N

        for k in range(hmm.N):
            state = hmm.getState(k)
            strout.append( "\n\nState number "+ str(k) +":")
            if(state.desc is not None):
                strout.append("\nState Name: " + state.desc)
            strout.append( "\nState order: " + str(order[k]))
            strout.append( "\nInitial probability: " + str(state.pi))
            #strout.append("\nsilent state: " + str(self.cmodel.silent[k]))
            strout.append( "\nOutput probabilites: ")
            for outp in range(hmm.M**(order[k]+1)):
                strout.append(str(ghmmwrapper.double_array_getitem(state.b,outp)))
                if outp % hmm.M == hmm.M-1:
                    strout.append( "\n")
                else:
                    strout.append( ", ")

            strout.append( "\nOutgoing transitions:")
            for i in range( state.out_states):
                strout.append( "\ntransition to state " + str( state.getOutState(i)))
                strout.append(" with probability " + str(ghmmwrapper.double_array_getitem(state.out_a,i)))
            strout.append( "\nIngoing transitions:")
            for i in range(state.in_states):
                strout.append( "\ntransition from state " + str( state.getInState(i)))
                strout.append( " with probability " + str(ghmmwrapper.double_array_getitem(state.in_a,i)))
            strout.append( "\nint fix:" + str(state.fix) + "\n")

        if self.hasFlags(kSilentStates):
            strout.append("\nSilent states: \n")
            for k in range(hmm.N):
                strout.append( str(self.cmodel.getSilent(k)) + ", ")
        strout.append( "\n")
        return join(strout,'')



    def extendDurations(self, durationlist):
        """ extend states with durations larger than one.

        @note this done by explicit state copying in C
        """

        for i in range(len(durationlist)):
            if durationlist[i] > 1:
                error = self.cmodel.duration_apply(i, durationlist[i])
                if error:
                    log.error( "durations not applied")
                else:
                    self.N = self.cmodel.N

    def getEmission(self, i):
        i = self.state(i)
        state = self.cmodel.getState(i)
        if self.hasFlags(kHigherOrderEmissions):
            order = ghmmwrapper.int_array_getitem(self.cmodel.order, i)
            emissions = ghmmwrapper.double_array2list(state.b, self.M**(order+1))
        else:
            emissions = ghmmwrapper.double_array2list(state.b, self.M)
        return emissions

    def setEmission(self, i, distributionParameters):
        """ Set the emission distribution parameters for a discrete model."""
        i = self.state(i)
        if not len(distributionParameters) == self.M:
            raise TypeError

        state = self.cmodel.getState(i)

        # updating silent flag and/or model type if necessary
        if self.hasFlags(kSilentStates):
            if sum(distributionParameters) == 0.0:
                self.cmodel.setSilent(i, 1)
            else:
                self.cmodel.setSilent(i, 0)
                #change model_type and free array if no silent state is left
                if 0 == sum(ghmmwrapper.int_array2list(self.cmodel.silent,self.N)):
                    self.clearFlags(kSilentStates)
                    ghmmwrapper.free(self.cmodel.silent)
                    self.cmodel.silent = None
        #if the state becomes the first silent state allocate memory and set the silen flag
        elif sum(distributionParameters) == 0.0:
            self.setFlags(kSilentStates)
            slist = [0]*self.N
            slist[i] = 1
            self.cmodel.silent = ghmmwrapper.list2int_array(slist)

        #set the emission probabilities
        ghmmwrapper.free(state.b)
        state.b = ghmmwrapper.list2double_array(distributionParameters)


    # XXX Change name?
    def backwardTermination(self, emissionSequence, pybeta, scalingVector):
        """
        Result: the backward log probability of emissionSequence
        """
        seq = emissionSequence.cseq.getSequence(0)

        # parsing 'scalingVector' to C double array.
        cscale = ghmmwrapper.list2double_array(scalingVector)

        # alllocating beta matrix
        t = len(emissionSequence)
        cbeta = ghmmhelper.list2double_matrix(pybeta)
        #print cbeta[0]

        error, logp = self.cmodel.backward_termination(seq, t, cbeta[0], cscale)
        if error == -1:
            log.error("backward finished with -1: EmissionSequence cannot be build.")

        # deallocation
        ghmmwrapper.free(cscale)
        ghmmwrapper.double_matrix_free(cbeta[0],t)
        return logp

    def baumWelch(self, trainingSequences, nrSteps=ghmmwrapper.MAX_ITER_BW, loglikelihoodCutoff=ghmmwrapper.EPS_ITER_BW):
        """ Reestimates the model with the sequence in 'trainingSequences'.

        @note that training for models including silent states is not yet
        supported.

        @param trainingSequences EmissionSequence or SequenceSet object
        @param nrSteps the maximal number of BW-steps
        @param loglikelihoodCutoff the least relative improvement in likelihood
        with respect to the last iteration required to continue.

        """
        if not isinstance(trainingSequences,EmissionSequence) and not isinstance(trainingSequences,SequenceSet):
            raise TypeError("EmissionSequence or SequenceSet required, got " + str(trainingSequences.__class__.__name__))

        if self.hasFlags(kSilentStates):
            raise NotImplementedError("Sorry, training of models containing silent states not yet supported.")

        self.cmodel.baum_welch_nstep(trainingSequences.cseq, nrSteps, loglikelihoodCutoff)

    def fbGibbs(self, trainingSequences,  pA, pB, pPi, burnIn = 100, seed = 0):
        """Reestimates the model and returns a sampled state sequence

        @note uses gsl, silent states not supported

        @param seed int for random seed, 0 default 
        @param trainingSequences EmissionSequence
        @param pA prior count for transitions
        @param pB prior count for emissions
        @param pPI prior count for initial state
        @param burnin number of iterations
        @return set of sampled paths for each training sequence
        @warning work in progress
        """
        if not isinstance(trainingSequences,EmissionSequence) and not isinstance(trainingSequences,SequenceSet):
            raise TypeError("EmissionSequence or SequenceSet required, got " + str(trainingSequences.__class__.__name__))       
        if self.hasFlags(kSilentStates):
            raise NotImplementedError("Sorry, training of models containing silent states not yet supported.")
        A, i = ghmmhelper.list2double_matrix(pA)
        if self.hasFlags(kHigherOrderEmissions):
            B=ghmmwrapper.double_matrix_alloc_row(len(pB))
            for i in range(len(pB)):
               ghmmwrapper.double_matrix_set_col(B, i,ghmmwrapper.list2double_array(pB[i]))
        else:
            B, j = ghmmhelper.list2double_matrix(pB)
        Pi = ghmmwrapper.list2double_array(pPi)

        return ghmmhelper.int_matrix2list(self.cmodel.fbgibbs(trainingSequences.cseq, A, B, Pi, burnIn,seed), trainingSequences.cseq.seq_number, len(trainingSequences))

    def cfbGibbs(self,trainingSequences, pA, pB, pPi,  R=-1, burnIn = 100, seed = 0):
        """Reestimates the model and returns a sampled state sequence

        @note uses gsl, silent states not supported

        @param seed int for random seed, 0 default 
        @param trainingSequences EmissionSequence or SequenceSet
        @param pA prior count for transitions
        @param pB prior count for emissions
        @param pPI prior count for initial state
        @param R length of uniform compression >0, works best for .5log(sqrt(T)) where T is length of seq
        @param burnin number of iterations
        @return set of sampled paths for each training sequence
        @warning work in progress
        """
        if not isinstance(trainingSequences,EmissionSequence) and not isinstance(trainingSequences,SequenceSet):
            raise TypeError("EmissionSequence or SequenceSet required, got " + str(trainingSequences.__class__.__name__))

        if self.hasFlags(kSilentStates):
            raise NotImplementedError("Sorry, training of models containing silent states not yet supported.")
        if R == -1:
            R = int(math.ceil(.5*math.log(math.sqrt(len(trainingSequences)))))
            #print R
        if R <= 1: 
            R = 2
        A, i = ghmmhelper.list2double_matrix(pA)
        if self.hasFlags(kHigherOrderEmissions):
            B=ghmmwrapper.double_matrix_alloc_row(len(pB))
            for i in range(len(pB)):
               ghmmwrapper.double_matrix_set_col(B, i,ghmmwrapper.list2double_array(pB[i]))
        else:
            B, j = ghmmhelper.list2double_matrix(pB)
        Pi = ghmmwrapper.list2double_array(pPi)
        return ghmmhelper.int_matrix2list(self.cmodel.cfbgibbs(trainingSequences.cseq, A, B, Pi, R, burnIn, seed), trainingSequences.cseq.seq_number, len(trainingSequences))

    def applyBackgrounds(self, backgroundWeight):
        """
        Apply the background distribution to the emission probabilities of states
        which have been assigned one (usually in the editor and coded in the XML).

        applyBackground computes a convex combination of the emission probability
        and the background

        @param backgroundWeight (within [0,1]) controls the background's
        contribution for each state.
        """
        if not len(backgroundWeight) == self.N:
            raise TypeError("Argument 'backgroundWeight' does not match number of states.")

        cweights = ghmmwrapper.list2double_array(backgroundWeight)
        result = self.cmodel.background_apply(cweights)

        ghmmwrapper.free(cweights)
        if result:
            log.error("applyBackground failed.")


    def setBackgrounds(self, backgroundObject, stateBackground):
        """
        Configure model to use the background distributions in 'backgroundObject'.

        @param backgroundObject BackgroundDistribution
        @param 'stateBackground' a list of indixes (one for each state) refering
        to distributions in 'backgroundObject'.

        @note values in backgroundObject are deep copied into the model
        """

        if not isinstance(backgroundObject,BackgroundDistribution):
            raise TypeError("BackgroundDistribution required, got " + str(emissionSequences.__class__.__name__))

        if not type(stateBackground) == list:
            raise TypeError("list required got "+ str(type(stateBackground)))

        if not len(stateBackground) == self.N:
            raise TypeError("Argument 'stateBackground' does not match number of states.")

        if self.background != None:
            del(self.background)
            ghmmwrapper.free(self.cmodel.background_id)
        self.background = backgroundObject.getCopy()
        self.cmodel.bp = self.background.cbackground
        self.cmodel.background_id = ghmmwrapper.list2int_array(stateBackground)

        # updating model type
        self.setFlags(kBackgroundDistributions)

    def setBackgroundAssignments(self, stateBackground):
        """ Change all the assignments of background distributions to states.

        Input is a list of background ids or '-1' for no background, or list of background names
        """
        if not type(stateBackground) == list:
            raise TypeError("list required got "+ str(type(stateBackground)))

        assert self.cmodel.background_id is not None, "Error: No backgrounds defined in model."
        assert len(stateBackground) == self.N, "Error: Number of weigths does not match number of states."
        # check for valid background id
        for d in stateBackground:
            if type(d) == str:
                assert d in self.background.name2id, "Error:  Invalid background distribution name."
                d = self.background.name2id[d]
            assert d in range(self.background.cbackground.n), "Error: Invalid background distribution id."

        for i, b_id in enumerate(stateBackground):
            if type(b_id) == str:
                b_id = self.background.name2id[b_id]
            ghmmwrapper.int_array_setitem(self.cmodel.background_id, i, b_id)


    def getBackgroundAssignments(self):
        """ Get the background assignments of all states

        '-1' -> no background
        """
        if self.hasFlags(kBackgroundDistributions):
            return ghmmwrapper.int_array2list(self.cmodel.background_id, self.N)


    def updateTiedEmissions(self):
        """ Averages emission probabilities of tied states. """
        assert self.hasFlags(kTiedEmissions) and self.cmodel.tied_to is not None, "cmodel.tied_to is undefined."
        self.cmodel.update_tie_groups()


    def setTieGroups(self, tieList):
        """ Sets the tied emission groups

        @param tieList contains for every state either '-1' or the index
        of the tied emission group leader.

        @note The tied emission group leader is tied to itself
        """
        if len(tieList) != self.N:
            raise IndexError("Number of entries in tieList is different from number of states.")

        if self.cmodel.tied_to is None:
            log.debug( "allocating tied_to")
            self.cmodel.tied_to = ghmmwrapper.list2int_array(tieList)
            self.setFlags(kTiedEmissions)
        else:
            log.debug( "tied_to already initialized")
            for i in range(self.N):
                ghmmwrapper.int_array_setitem(self.cmodel.tied_to,i,tieList[i])


    def removeTieGroups(self):
        """ Removes all tied emission information. """
        if self.hasFlags(kTiedEmissions) and self.cmodel.tied_to != None:
            ghmmwrapper.free(self.cmodel.tied_to)
            self.cmodel.tied_to = None
            self.clearFlags(kTiedEmissions)

    def getTieGroups(self):
        """ Gets tied emission group structure. """
        if not self.hasFlags(kTiedEmissions) or self.cmodel.tied_to is None:
            raise TypeError("HMM has no tied emissions or self.cmodel.tied_to is undefined.")

        return ghmmwrapper.int_array2list(self.cmodel.tied_to, self.N)


    def getSilentFlag(self,state):
        state = self.state(state)
        if self.hasFlags(kSilentStates):
            return self.cmodel.getSilent(state)
        else:
            return 0

    def asMatrices(self):
        "Return the parameters in matrix form."
        A = []
        B = []
        pi = []
        if self.hasFlags(kHigherOrderEmissions):
            order = ghmmwrapper.int_array2list(self.cmodel.order, self.N)
        else:
            order = [0]*self.N

        for i in range(self.cmodel.N):
            A.append([0.0] * self.N)
            state = self.cmodel.getState(i)
            pi.append(state.pi)
            B.append(ghmmwrapper.double_array2list(state.b,self.M ** (order[i]+1)))
            for j in range(state.out_states):
                state_index = ghmmwrapper.int_array_getitem(state.out_id, j)
                A[i][state_index] = ghmmwrapper.double_array_getitem(state.out_a,j)

        return [A,B,pi]


    def isSilent(self,state):
        """
        @returns True if 'state' is silent, False otherwise
        """
        state = self.state(state)
        if not 0 <= state <= self.N-1:
            raise IndexError("Invalid state index")

        if self.hasFlags(kSilentStates) and self.cmodel.silent[state]:
            return True
        else:
            return False

    def write(self,fileName):
        """
        Writes HMM to file 'fileName'.
        """
        if self.cmodel.alphabet is None:
            self.cmodel.alphabet = self.emissionDomain.toCstruct()

        self.cmodel.write_xml(fileName)




######################################################
class StateLabelHMM(DiscreteEmissionHMM):
    """ Labelled HMMs with discrete emissions.

        Same feature list as in DiscreteEmissionHMM models.
    """
    def __init__(self, emissionDomain, distribution, labelDomain, cmodel):
        DiscreteEmissionHMM.__init__(self, emissionDomain, distribution, cmodel)

        if not isinstance(labelDomain, LabelDomain):
            raise TypeError("Invalid labelDomain")

        self.labelDomain = labelDomain


    def __str__(self):
        hmm = self.cmodel
        strout = [str(self.__class__.__name__)]
        if self.cmodel.name:
            strout.append( " " + str(self.cmodel.name))
        strout.append(  "(N= "+ str(hmm.N))
        strout.append(  ", M= "+ str(hmm.M)+')\n')

        f = lambda x: "%.2f" % (x,) # float rounding function

        if self.hasFlags(kHigherOrderEmissions):
            order = ghmmwrapper.int_array2list(self.cmodel.order, self.N)
        else:
            order = [0]*hmm.N
        label = ghmmwrapper.int_array2list(hmm.label, self.N)

        if hmm.N <= 4:
            iter_list = list(range(self.N))
        else:
            iter_list = [0,1,'X',hmm.N-2,hmm.N-1]

        for k in iter_list:
            if k == 'X':
                strout.append('\n  ...\n\n')
                continue

            state = hmm.getState(k)
            strout.append( "  state "+ str(k) +' (')
            if order[k] > 0:
                strout.append( 'order= '+ str(order[k])+',')

            strout.append( "initial= " + f(state.pi)+', label= ' + str(self.labelDomain.external(label[k])) + ')\n')
            strout.append( "    Emissions: ")
            for outp in range(hmm.M**(order[k]+1)):
                strout.append(f(ghmmwrapper.double_array_getitem(state.b,outp)))
                if outp < hmm.M**(order[k]+1)-1:
                    strout.append( ', ')
                else:
                    strout.append('\n')

            strout.append( "    Transitions:")
            #trans = [0.0] * hmm.N
            for i in range( state.out_states):
                strout.append( " ->" + str( state.getOutState(i))+' ('+ f(ghmmwrapper.double_array_getitem(state.out_a,i) ) +')' )
                if i < state.out_states-1:
                    strout.append( ',')
                #strout.append(" with probability " + str(ghmmwrapper.double_array_getitem(state.out_a,i)))

            strout.append('\n')

        return join(strout,'')


    def verboseStr(self):
        hmm = self.cmodel
        strout = ["\nGHMM Model\n"]
        strout.append("Name: " + str(self.cmodel.name))
        strout.append("\nModelflags: "+ self.printtypes(self.cmodel.model_type))
        strout.append("\nNumber of states: "+ str(hmm.N))
        strout.append("\nSize of Alphabet: "+ str(hmm.M))

        if hmm.model_type & kHigherOrderEmissions:
            order = ghmmwrapper.int_array2list(hmm.order, self.N)
        else:
            order = [0]*hmm.N
        label = ghmmwrapper.int_array2list(hmm.label, self.N)
        for k in range(hmm.N):
            state = hmm.getState(k)
            strout.append("\n\nState number "+ str(k) +":")
            if(state.desc is not None):
                strout.append("\nState Name: " + state.desc)
            strout.append("\nState label: "+str(self.labelDomain.external(label[k])))

            strout.append("\nState order: " + str(order[k]))
            strout.append("\nInitial probability: " + str(state.pi))
            strout.append("\nOutput probabilites:\n")
            for outp in range(hmm.M**(order[k]+1)):
                strout+=str(ghmmwrapper.double_array_getitem(state.b,outp))
                if outp % hmm.M == hmm.M-1:
                    strout.append("\n")
                else:
                    strout.append(", ")

            strout.append("Outgoing transitions:")
            for i in range( state.out_states):
                strout.append("\ntransition to state " + str(state.getOutState(i)) + " with probability " + str(state.getOutProb(i)))
            strout.append( "\nIngoing transitions:")
            for i in range(state.in_states):
                strout.append( "\ntransition from state " + str(state.getInState(i)) + " with probability " + str(state.getInProb(i)))
            strout.append("\nint fix:" + str(state.fix) + "\n")

        if hmm.model_type & kSilentStates:
            strout.append("\nSilent states: \n")
            for k in range(hmm.N):
                strout.append(str(hmm.silent[k]) + ", ")
            strout.append("\n")

        return join(strout,'')

    def setLabels(self, labelList):
        """  Set the state labels to the values given in labelList.

        LabelList is in external representation.
        """

        assert len(labelList) == self.N, "Invalid number of labels."

        # set state label to to the appropiate index
        for i in range(self.N):
            if not self.labelDomain.isAdmissable(labelList[i]):
                raise GHMMOutOfDomain("Label "+str(labelList[i])+" not included in labelDomain.")

        ghmmwrapper.free(self.cmodel.label)
        self.cmodel.label = ghmmwrapper.list2int_array([self.labelDomain.internal(l) for l in labelList])

    def getLabels(self):
        labels = ghmmwrapper.int_array2list(self.cmodel.label, self.N)
        return [self.labelDomain.external(l) for l in labels]

    def getLabel(self,stateIndex):
        """
        @returns label of the state 'stateIndex'.
        """
        return self.cmodel.getLabel(stateIndex)

    def externalLabel(self, internal):
        """
        @returns label representation of an int or list of ints
        """

        if type(internal) is int:
            return self.labelDomain.external[internal] # return Label
        elif type(internal) is list:
            return self.labelDomain.externalSequence(internal)
        else:
            raise TypeError('int or list needed')

    def internalLabel(self, external):
        """
        @returns int representation of an label or list of labels
        """

        if type(external) is list:
            return self.labelDomain.internalSequence(external)
        else:
            return self.labelDomain.internal(external)

    def sampleSingle(self, seqLength, seed = 0):
        seqPtr = self.cmodel.label_generate_sequences(seed, seqLength, 1, seqLength)
        return EmissionSequence(self.emissionDomain, seqPtr, labelDomain = self.labelDomain )

    def sample(self, seqNr,seqLength, seed = 0):
        seqPtr = self.cmodel.label_generate_sequences(seed, seqLength, seqNr, seqLength)
        return SequenceSet(self.emissionDomain,seqPtr, labelDomain = self.labelDomain)


    def labeledViterbi(self, emissionSequences):
        """
        @returns the labeling of the input sequence(s) as given by the viterbi
        path.

        For one EmissionSequence a list of labels is returned; for an SequenceSet
        a list of lists of labels.

        """
        emissionSequences = emissionSequences.asSequenceSet()
        seqNumber = len(emissionSequences)

        if not emissionSequences.emissionDomain == self.emissionDomain:
            raise TypeError("Sequence and model emissionDomains are incompatible.")

        vPath, log_p = self.viterbi(emissionSequences)

        f = lambda i: self.labelDomain.external(self.getLabel(i))
        if seqNumber == 1:
            labels = list(map(f, vPath))
        else:
            labels = [list(map(f, vp)) for vp in vPath]

        return (labels, log_p)


    def kbest(self, emissionSequences, k = 1):
        """ Compute the k probable labeling for each sequence in emissionSequences

        @param emissionSequences can either be a SequenceSet or an
        EmissionSequence
        @param k the number of labelings to produce

        Result: [l_0, ..., l_T] the labeling of emissionSequences is an
        EmmissionSequence object,
        [[l_0^0, ..., l_T^0], ..., [l_0^j, ..., l_T^j]} for a j-sequence
        SequenceSet
        """
        if self.hasFlags(kSilentStates):
            raise NotimplementedError("Sorry, k-best decoding on models containing silent states not yet supported.")

        emissionSequences = emissionSequences.asSequenceSet()
        seqNumber = len(emissionSequences)

        allLogs = []
        allLabels = []

        for i in range(seqNumber):
            seq = emissionSequences.cseq.getSequence(i)
            seq_len = emissionSequences.cseq.getLength(i)

            labeling, log_p = self.cmodel.label_kbest(seq, seq_len, k)
            oneLabel = ghmmwrapper.int_array2list(labeling, seq_len)

            allLabels.append(oneLabel)
            allLogs.append(log_p)
            ghmmwrapper.free(labeling)

        if emissionSequences.cseq.seq_number > 1:
            return (list(map(self.externalLabel, allLabels)), allLogs)
        else:
            return (self.externalLabel(allLabels[0]), allLogs[0])


    def gradientSearch(self, emissionSequences, eta=.1, steps=20):
        """ trains a model with given sequences using a gradient descent algorithm

        @param emissionSequences can either be a SequenceSet or an
        EmissionSequence
        @param eta algortihm terminates if the descent is smaller than eta
        @param steps number of iterations
        """

        # check for labels
        if not self.hasFlags(kLabeledStates):
            raise NotImplementedError("Error: Model is no labeled states.")

        emissionSequences = emissionSequences.asSequenceSet()
        seqNumber = len(emissionSequences)

        tmp_model = self.cmodel.label_gradient_descent(emissionSequences.cseq, eta, steps)
        if tmp_model is None:
            log.error("Gradient descent finished not successfully.")
            return False
        else:
            self.cmodel = tmp_model
            return True

    def labeledlogikelihoods(self, emissionSequences):
        """ Compute a vector ( log( P[s,l| model]) )_{s} of log-likelihoods of the
        individual p emissionSequences using the forward algorithm

        @param emissionSequences SequenceSet

        Result: log( P[emissionSequences,labels| model]) of type float
        (numarray) vector of floats
        """
        emissionSequences = emissionSequences.asSequenceSet()
        seqNumber = len(emissionSequences)

        if emissionSequences.cseq.state_labels is None:
            raise TypeError("Sequence needs to be labeled.")

        likelihoodList = []

        for i in range(seqNumber):
            seq = emissionSequences.cseq.getSequence(i)
            labels = ghmmwrapper.int_matrix_get_col(emissionSequences.cseq.state_labels,i)
            tmp = emissionSequences.cseq.getLength(i)
            ret_val,likelihood = self.cmodel.label_logp(seq, labels, tmp)

            if ret_val == -1:
                log.warning("forward returned -1: Sequence"+ str(i) +"cannot be build.")
                likelihoodList.append(-float('Inf'))
            else:
                likelihoodList.append(likelihood)

        return likelihoodList

    def labeledForward(self, emissionSequence, labelSequence):
        """

        Result: the (N x T)-matrix containing the forward-variables
        and the scaling vector
        """
        if not isinstance(emissionSequence,EmissionSequence):
            raise TypeError("EmissionSequence required, got " + str(emissionSequence.__class__.__name__))

        n_states = self.cmodel.N

        t = emissionSequence.cseq.getLength(0)
        if t != len(labelSequence):
            raise TypeError("emissionSequence and labelSequence must have same length")

        calpha = ghmmwrapper.double_matrix_alloc(t, n_states)
        cscale = ghmmwrapper.double_array_alloc(t)

        seq = emissionSequence.cseq.getSequence(0)
        label = ghmmwrapper.list2int_array(self.internalLabel(labelSequence))

        error, logp = self.cmodel.label_forward(seq, label, t, calpha, cscale)
        if error == -1:
            log.error( "Forward finished with -1: Sequence cannot be build.")

        # translate alpha / scale to python lists
        pyscale = ghmmwrapper.double_array2list(cscale, t)
        pyalpha = ghmmhelper.double_matrix2list(calpha,t,n_states)

        ghmmwrapper.free(label)
        ghmmwrapper.free(cscale)
        ghmmwrapper.double_matrix_free(calpha,t)
        return (logp, pyalpha, pyscale)

    def labeledBackward(self, emissionSequence, labelSequence, scalingVector):
        """

            Result: the (N x T)-matrix containing the backward-variables
        """
        if not isinstance(emissionSequence,EmissionSequence):
            raise TypeError("EmissionSequence required, got " + str(emissionSequence.__class__.__name__))

        t = emissionSequence.cseq.getLength(0)
        if t != len(labelSequence):
            raise TypeError("emissionSequence and labelSequence must have same length")

        seq = emissionSequence.cseq.getSequence(0)
        label = ghmmwrapper.list2int_array(self.internalLabel(labelSequence))

        # parsing 'scalingVector' to C double array.
        cscale = ghmmwrapper.list2double_array(scalingVector)

        # alllocating beta matrix
        cbeta = ghmmwrapper.double_matrix_alloc(t, self.cmodel.N)

        error,logp = self.cmodel.label_backward(seq, label, t, cbeta, cscale)
        if error == -1:
            log.error( "backward finished with -1: EmissionSequence cannot be build.")

        pybeta = ghmmhelper.double_matrix2list(cbeta,t,self.cmodel.N)

        # deallocation
        ghmmwrapper.free(cscale)
        ghmmwrapper.free(label)
        ghmmwrapper.double_matrix_free(cbeta,t)
        return (logp, pybeta)

    def labeledBaumWelch(self, trainingSequences, nrSteps=ghmmwrapper.MAX_ITER_BW,
                         loglikelihoodCutoff=ghmmwrapper.EPS_ITER_BW):
        """ Reestimates the model with the sequence in 'trainingSequences'.

        @note that training for models including silent states is not yet
        supported.

        @param trainingSequences EmissionSequence or SequenceSet object
        @param nrSteps the maximal number of BW-steps
        @param loglikelihoodCutoff the least relative improvement in likelihood
        with respect to the last iteration required to continue.

        """
        if not isinstance(trainingSequences,EmissionSequence) and not isinstance(trainingSequences,SequenceSet):
            raise TypeError("EmissionSequence or SequenceSet required, got " + str(trainingSequences.__class__.__name__))

        if self.hasFlags(kSilentStates):
            raise NotImplementedError("Sorry, training of models containing silent states not yet supported.")

        self.cmodel.label_baum_welch_nstep(trainingSequences.cseq, nrSteps, loglikelihoodCutoff)


    def write(self,fileName):
        """ Writes HMM to file 'fileName'.

        """
        if self.cmodel.alphabet is None:
            self.cmodel.alphabet = self.emissionDomain.toCstruct()

        if self.cmodel.label_alphabet is None:
            self.cmodel.label_alphabet = self.labelDomain.toCstruct()

        self.cmodel.write_xml(fileName)



class GaussianEmissionHMM(HMM):
    """ HMMs with Gaussian distribution as emissions.

    """

    def __init__(self, emissionDomain, distribution, cmodel):
        HMM.__init__(self, emissionDomain, distribution, cmodel)

        # Baum Welch context, call baumWelchSetup to initalize
        self.BWcontext = None

    def getTransition(self, i, j):
        """ @returns the probability of the transition from state i to state j.
        Raises IndexError if the transition is not allowed
        """
        i = self.state(i)
        j = self.state(j)

        transition = self.cmodel.get_transition(i, j, 0)
        if transition < 0.0: # Tried to access non-existing edge:
            transition = 0.0
        return transition

    def setTransition(self, i, j, prob):
        """ Accessor function for the transition a_ij """

        i = self.state(i)
        j = self.state(j)

        if not self.cmodel.check_transition(i, j, 0):
            raise ValueError("No transition between state " + str(i) + " and " + str(j))

        self.cmodel.set_transition(i, j, 0, float(prob))

    def getEmission(self, i):
        """ @returns (mu, sigma^2)  """
        i = self.state(i)
        if not 0 <= i < self.N:
            raise IndexError("Index " + str(i) + " out of bounds.")

        state = self.cmodel.getState(i)
        mu    = state.getMean(0)
        sigma = state.getStdDev(0)
        return (mu, sigma)

    def setEmission(self, i, values):
        """ Set the emission distributionParameters for state i

        @param i index of a state
        @param values tuple of mu, sigma
        """
        mu, sigma = values
        i = self.state(i)

        state = self.cmodel.getState(i)
        state.setMean(0, float(mu))
        state.setStdDev(0, float(sigma))

    def getEmissionProbability(self, value, i):
        """ @returns probability of emitting value in state i  """
        i = self.state(i)

        # value can be float or vector of floats
        try:
            assert len(value) == self.cmodel.dim
        except (TypeError):
            assert 1 == self.cmodel.dim
            v = [float(value)]
        else:
            v = value

        state = self.cmodel.getState(i)
        valueptr = ghmmwrapper.list2double_array(v)
        p = state.calc_b(valueptr)
        ghmmwrapper.free(valueptr)
        return p

    
    def __str__(self):
        hmm = self.cmodel
        strout = [str(self.__class__.__name__)]
        if self.cmodel.name:
            strout.append( " " + str(self.cmodel.name))
        strout.append(  "(N="+ str(hmm.N)+')\n')

        f = lambda x: "%.2f" % (x,)  # float rounding function

        if hmm.N <= 4:
            iter_list = list(range(self.N))
        else:
            iter_list = [0,1,'X',hmm.N-2,hmm.N-1]

        for k in iter_list:
            if k == 'X':
                strout.append('\n  ...\n\n')
                continue

            state = hmm.getState(k)
            strout.append("  state "+ str(k) + " (")
            strout.append( "initial=" + f(state.pi) )
            if self.cmodel.cos > 1:
                strout.append(', cos='+ str(self.cmodel.cos))
            strout.append(", mu=" + f(state.getMean(0))+', ')
            strout.append("sigma=" + f(state.getStdDev(0)) )
            strout.append(')\n')



            strout.append( "    Transitions: ")
            if self.cmodel.cos > 1:
                strout.append("\n")

            for c in range(self.cmodel.cos):
                if self.cmodel.cos > 1:
                    strout.append('      class: ' + str(c)+ ':'  )
                for i in range( state.out_states):
                    strout.append('->' + str(state.getOutState(i)) + ' (' + f(state.getOutProb(i, c))+')' )
                    if i < state.out_states-1:
                        strout.append( ', ')

                strout.append('\n')

        return join(strout,'')


    def verboseStr(self):
        hmm = self.cmodel
        strout = ["\nHMM Overview:"]
        strout.append("\nNumber of states: " + str(hmm.N))
        strout.append("\nNumber of mixture components: " + str(hmm.M))

        for k in range(hmm.N):
            state = hmm.getState(k)
            strout.append("\n\nState number "+ str(k) + ":")
            if(state.desc is not None):
                strout.append("\nState Name: " + state.desc)
            strout.append("\nInitial probability: " + str(state.pi) + "\n")

            weight = ""
            mue = ""
            u =  ""

            weight += str(ghmmwrapper.double_array_getitem(state.c,0))
            mue += str(state.getMean(0))
            u += str(state.getStdDev(0))

            strout.append("  mean: " + str(mue) + "\n")
            strout.append("  variance: " + str(u) + "\n")
            strout.append("  fix: " + str(state.fix) + "\n")

            for c in range(self.cmodel.cos):
                strout.append("\n  Class : " + str(c)                )
                strout.append("\n    Outgoing transitions:")
                for i in range( state.out_states):
                    strout.append("\n      transition to state " + str(state.getOutState(i)) + " with probability = " + str(state.getOutProb(i, c)))
                strout.append("\n    Ingoing transitions:")
                for i in range(state.in_states):
                    strout.append("\n      transition from state " + str(state.getInState(i)) +" with probability = "+ str(state.getInProb(i, c)))


        return join(strout,'')

    def forward(self, emissionSequence):
        """

        Result: the (N x T)-matrix containing the forward-variables
        and the scaling vector
        """
        if not isinstance(emissionSequence,EmissionSequence):
            raise TypeError("EmissionSequence required, got " + str(emissionSequence.__class__.__name__))

        i = self.cmodel.N

        t = emissionSequence.cseq.getLength(0)
        calpha = ghmmwrapper.double_matrix_alloc (t, i)
        cscale = ghmmwrapper.double_array_alloc(t)

        seq = emissionSequence.cseq.getSequence(0)

        error, logp = self.cmodel.forward(seq, t, None, calpha, cscale)
        if error == -1:
            log.error( "Forward finished with -1: Sequence " + str(seq_nr) + " cannot be build.")

        # translate alpha / scale to python lists
        pyscale = ghmmwrapper.double_array2list(cscale, t) # XXX return Python2.5 arrays???
        pyalpha = ghmmhelper.double_matrix2list(calpha,t,i) # XXX return Python2.5 arrays? Also
        # XXX Check Matrix-valued input.

        ghmmwrapper.free(cscale)
        ghmmwrapper.double_matrix_free(calpha,t)
        return (pyalpha,pyscale)

    def backward(self, emissionSequence, scalingVector):
        """

        Result: the (N x T)-matrix containing the backward-variables
        """
        if not isinstance(emissionSequence,EmissionSequence):
            raise TypeError("EmissionSequence required, got " + str(emissionSequence.__class__.__name__))

        seq = emissionSequence.cseq.getSequence(0)

        # parsing 'scalingVector' to C double array.
        cscale = ghmmwrapper.list2double_array(scalingVector)

        # alllocating beta matrix
        t = emissionSequence.cseq.getLength(0)
        cbeta = ghmmwrapper.double_matrix_alloc(t, self.cmodel.N)

        error = self.cmodel.backward(seq,t,None,cbeta,cscale)
        if error == -1:
            log.error( "backward finished with -1: EmissionSequence cannot be build.")


        pybeta = ghmmhelper.double_matrix2list(cbeta,t,self.cmodel.N)

        # deallocation
        ghmmwrapper.free(cscale)
        ghmmwrapper.double_matrix_free(cbeta,t)
        return pybeta

    def loglikelihoods(self, emissionSequences):
        """ Compute a vector ( log( P[s| model]) )_{s} of log-likelihoods of the
        individual emissionSequences using the forward algorithm.

        @param emissionSequences SequenceSet

        Result: log( P[emissionSequences| model]) of type float
        (numarray) vector of floats

        """
        emissionSequences = emissionSequences.asSequenceSet()
        seqNumber = len(emissionSequences)

        if self.cmodel.cos > 1:
            log.debug( "self.cmodel.cos = " + str( self.cmodel.cos) )
            assert self.cmodel.class_change is not None, "Error: class_change not initialized."

        likelihoodList = []

        for i in range(seqNumber):
            seq = emissionSequences.cseq.getSequence(i)
            tmp = emissionSequences.cseq.getLength(i)

            if self.cmodel.cos > 1:
                self.cmodel.class_change.k = i

            ret_val, likelihood = self.cmodel.logp(seq, tmp)
            if ret_val == -1:

                log.warning( "forward returned -1: Sequence "+str(i)+" cannot be build.")
                # XXX TODO: Eventually this should trickle down to C-level
                # Returning -DBL_MIN instead of infinity is stupid, since the latter allows
                # to continue further computations with that inf, which causes
                # things to blow up later.
                # cmodel.logp() could do without a return value if -Inf is returned
                # What should be the semantics in case of computing the likelihood of
                # a set of sequences?
                likelihoodList.append(-float('Inf'))
            else:
                likelihoodList.append(likelihood)

        # resetting class_change->k to default
        if self.cmodel.cos > 1:
            self.cmodel.class_change.k = -1

        return likelihoodList


    def viterbi(self, emissionSequences):
        """ Compute the Viterbi-path for each sequence in emissionSequences

        @param emissionSequences can either be a SequenceSet or an
        EmissionSequence

        Result: [q_0, ..., q_T] the viterbi-path of emission_sequences is an
        EmmissionSequence object,
        [[q_0^0, ..., q_T^0], ..., [q_0^k, ..., q_T^k]} for a k-sequence
        SequenceSet
        """
        emissionSequences = emissionSequences.asSequenceSet()
        seqNumber = len(emissionSequences)

        if self.cmodel.cos > 1:
            log.debug( "self.cmodel.cos = "+ str( self.cmodel.cos))
            assert self.cmodel.class_change is not None, "Error: class_change not initialized."

        allLogs = []
        allPaths = []
        for i in range(seqNumber):
            if self.cmodel.cos > 1:
                # if emissionSequence is a sequenceSet with multiple sequences,
                # use sequence index as class_change.k
                self.cmodel.class_change.k = i

            seq = emissionSequences.cseq.getSequence(i)
            seq_len = emissionSequences.cseq.getLength(i)

            try:
                viterbiPath, log_p = self.cmodel.viterbi(seq, seq_len)
            except TypeError:
                viterbiPath, log_p = (None, float("-infinity"))

            if viterbiPath != None:
                onePath = ghmmwrapper.int_array2list(viterbiPath, seq_len/self.cmodel.dim)
            else:
                onePath = []

            allPaths.append(onePath)
            allLogs.append(log_p)

        ghmmwrapper.free(viterbiPath)

        # resetting class_change->k to default
        if self.cmodel.cos > 1:
            self.cmodel.class_change.k = -1

        if emissionSequences.cseq.seq_number > 1:
            return (allPaths, allLogs)
        else:
            return (allPaths[0], allLogs[0])

    def baumWelch(self, trainingSequences, nrSteps=ghmmwrapper.MAX_ITER_BW, loglikelihoodCutoff=ghmmwrapper.EPS_ITER_BW):
        """ Reestimate the model parameters given the training_sequences.

        Perform at most nr_steps until the improvement in likelihood
        is below likelihood_cutoff

        @param trainingSequences can either be a SequenceSet or a Sequence
        @param nrSteps the maximal number of BW-steps
        @param loglikelihoodCutoff the least relative improvement in likelihood
        with respect to the last iteration required to continue.

        Result: Final loglikelihood
        """

        if not isinstance(trainingSequences, SequenceSet) and not isinstance(trainingSequences, EmissionSequence):
            raise TypeError("baumWelch requires a SequenceSet or EmissionSequence object.")

        if not self.emissionDomain.CDataType == "double":
            raise TypeError("Continuous sequence needed.")

        self.baumWelchSetup(trainingSequences, nrSteps, loglikelihoodCutoff)
        ghmmwrapper.ghmm_cmodel_baum_welch(self.BWcontext)
        likelihood = ghmmwrapper.double_array_getitem(self.BWcontext.logp, 0)
        #(steps_made, loglikelihood_array, scale_array) = self.baumWelchStep(nrSteps,
        #                                                                    loglikelihoodCutoff)
        self.baumWelchDelete()

        return likelihood

    def baumWelchSetup(self, trainingSequences, nrSteps, loglikelihoodCutoff=ghmmwrapper.EPS_ITER_BW):
        """ Setup necessary temporary variables for Baum-Welch-reestimation.

        Use with baumWelchStep for more control over the training, computing
        diagnostics or doing noise-insertion

        @param trainingSequences can either be a SequenceSet or a Sequence
        @param nrSteps the maximal number of BW-steps
        @param loglikelihoodCutoff the least relative improvement in likelihood
        with respect to the last iteration required to continue.
        """
        self.BWcontext = ghmmwrapper.ghmm_cmodel_baum_welch_context(
            self.cmodel, trainingSequences.cseq)
        self.BWcontext.eps = loglikelihoodCutoff
        self.BWcontext.max_iter = nrSteps


    def baumWelchStep(self, nrSteps, loglikelihoodCutoff):
        """
        Compute one iteration of Baum Welch estimation.

        Use with baumWelchSetup for more control over the training, computing
        diagnostics or doing noise-insertion
        """
        # XXX Implement me
        raise NotImplementedError

    def baumWelchDelete(self):
        """
        Delete the necessary temporary variables for Baum-Welch-reestimation
        """
        self.BWcontext = None

    def asMatrices(self):
        "Return the parameters in matrix form."
        A = []
        B = []
        pi = []
        for i in range(self.cmodel.N):
            A.append([0.0] * self.N)
            B.append([0.0] * 2)
            state = self.cmodel.getState(i)
            pi.append(state.pi)

            B[i][0] = state.getMean(0)
            B[i][1] = state.getStdDev(0)

            for j in range(state.out_states):
                state_index = ghmmwrapper.int_array_getitem(state.out_id, j)
                A[i][state_index] = ghmmwrapper.double_matrix_getitem(state.out_a,0,j)

        return [A,B,pi]


# XXX - this class will taken over by ContinuousMixtureHMM
class GaussianMixtureHMM(GaussianEmissionHMM):
    """ HMMs with mixtures of Gaussians as emissions.

    Optional features:
    - fixing mixture components in training

    """

    def getEmission(self, i, comp):
        """
        @returns (mu, sigma^2, weight) of component 'comp' in state 'i'
        """
        i = self.state(i)
        state  = self.cmodel.getState(i)
        mu     = state.getMean(comp)
        sigma  = state.getStdDev(comp)
        weigth = state.getWeight(comp)
        return (mu, sigma, weigth)

    def setEmission(self, i, comp, values):
        """ Set the emission distribution parameters for a single component in a single state.

        @param i index of a state
        @param comp index of a mixture component
        @param values tuple of mu, sigma, weight
        """
        mu, sigma, weight = values
        i = self.state(i)

        state = self.cmodel.getState(i)
        state.setMean(comp, float(mu))  # GHMM C is german: mue instead of mu
        state.setStdDev(comp, float(sigma))
        state.setWeight(comp, float(weight))

    def getMixtureFix(self,state):
        state = self.state(state)
        s = self.cmodel.getState(state)
        mixfix = []
        for i in range(s.M):
            emission = s.getEmission(i)
            mixfix.append(emission.fixed)
        return mixfix

    def setMixtureFix(self, state ,flags):
        state = self.state(state)
        s = self.cmodel.getState(state)
        for i in range(s.M):
            emission = s.getEmission(i)
            emission.fixed = flags[i]

    def __str__(self):
        hmm = self.cmodel
        strout = [str(self.__class__.__name__)]
        if self.cmodel.name:
            strout.append( " " + str(self.cmodel.name))
        strout.append(  "(N="+ str(hmm.N)+')\n')

        f = lambda x: "%.2f" % (x,)  # float rounding function

        if hmm.N <= 4:
            iter_list = list(range(self.N))
        else:
            iter_list = [0,1,'X',hmm.N-2,hmm.N-1]

        for k in iter_list:
            if k == 'X':
                strout.append('\n  ...\n\n')
                continue

            state = hmm.getState(k)
            strout.append("  state "+ str(k) + " (")
            strout.append( "initial=" + f(state.pi) )
            if self.cmodel.cos > 1:
                strout.append(', cos='+ str(self.cmodel.cos))
            strout.append(')\n')

            weight = ""
            mue = ""
            u =  ""

            for outp in range(state.M):
                emission = state.getEmission(outp)
                weight += str(ghmmwrapper.double_array_getitem(state.c,outp))+", "
                mue += str(emission.mean.val)+", "
                u += str(emission.variance.val)+", "

            strout.append( "    Emissions (")
            strout.append("weights=" + str(weight) + ", ")
            strout.append("mu=" + str(mue) + ", ")
            strout.append("sigma=" + str(u) + ")\n")


            strout.append( "    Transitions: ")
            if self.cmodel.cos > 1:
                strout.append("\n")

            for c in range(self.cmodel.cos):
                if self.cmodel.cos > 1:
                    strout.append('      class: ' + str(c)+ ':'  )
                for i in range( state.out_states):
                    strout.append('->' + str(state.getOutState(i)) + ' (' + str(state.getOutProb(i, c))+')' )
                    if i < state.out_states-1:
                        strout.append( ', ')

                strout.append('\n')

        return join(strout,'')


    def verboseStr(self):
        "defines string representation"
        hmm = self.cmodel

        strout = ["\nOverview of HMM:"]
        strout.append("\nNumber of states: "+ str(hmm.N))
        strout.append("\nNumber of mixture components: "+ str(hmm.M))

        for k in range(hmm.N):
            state = hmm.getState(k)
            strout.append("\n\nState number "+ str(k) +":")
            if(state.desc is not None):
                strout.append("\nState Name: " + state.desc)
            strout.append("\nInitial probability: " + str(state.pi))
            strout.append("\n"+ str(state.M) + " mixture component(s):\n")

            weight = ""
            mue = ""
            u =  ""

            for outp in range(state.M):
                emission = state.getEmission(outp)
                weight += str(ghmmwrapper.double_array_getitem(state.c,outp))+", "
                mue += str(emission.mean.val)+", "
                u += str(emission.variance.val)+", "

            strout.append("  pdf component weights : " + str(weight) + "\n")
            strout.append("  mean vector: " + str(mue) + "\n")
            strout.append("  variance vector: " + str(u) + "\n")

            for c in range(self.cmodel.cos):
                strout.append("\n  Class : " + str(c)                )
                strout.append("\n    Outgoing transitions:")
                for i in range( state.out_states):
                    strout.append("\n      transition to state " + str(state.getOutState(i)) + " with probability = " + str(state.getOutProb(i, c)))
                strout.append("\n    Ingoing transitions:")
                for i in range(state.in_states):
                    strout.append("\n      transition from state " + str(state.getInState(i)) +" with probability = "+ str(state.getInProb(i, c)))

            strout.append("\nint fix:" + str(state.fix) + "\n")
        return join(strout,'')


    def asMatrices(self):
        "Return the parameters in matrix form."
        A = []
        B = []
        pi = []
        for i in range(self.cmodel.N):
            A.append([0.0] * self.N)
            B.append([])
            state = self.cmodel.getState(i)
            pi.append(state.pi)

            mulist = []
            siglist = []
            for j in range(state.M):
                emission = state.getEmission(j)
                mulist.append(emission.mean.val)
                siglist.append(emission.variance.val)

            B[i].append(mulist)
            B[i].append(siglist)
            B[i].append(ghmmwrapper.double_array2list(state.c, state.M))

            for j in range(state.out_states):
                state_index = ghmmwrapper.int_array_getitem(state.out_id, j)
                A[i][state_index] = ghmmwrapper.double_matrix_getitem(state.out_a,0,j)

        return [A,B,pi]


class ContinuousMixtureHMM(GaussianMixtureHMM):
    """ HMMs with mixtures of any univariate (one dimensional) Continuous
    Distributions as emissions.

    Optional features:
    - fixing mixture components in training
    """

    def getEmission(self, i, comp):
        """
        @returns the paramenters of component 'comp' in state 'i'
        - (type, mu,  sigma^2, weight)        - for a gaussian component
        - (type, mu,  sigma^2, min,   weight) - for a right tail gaussian
        - (type, mu,  sigma^2, max,   weight) - for a left  tail gaussian
        - (type, max, mix,     weight)        - for a uniform
        """
        i = self.state(i)
        state  = self.cmodel.getState(i)
        emission = state.getEmission(comp)
        if (emission.type == ghmmwrapper.normal or
            emission.type == ghmmwrapper.normal_approx):
            return (emission.type, emission.mean.val, emission.variance.val, state.getWeight(comp))
        elif emission.type == ghmmwrapper.normal_right:
            return (emission.type, emission.mean.val, emission.variance.val,
                    emission.min, state.getWeight(comp))
        elif emission.type == ghmmwrapper.normal_left:
            return (emission.type, emission.mean.val, emission.variance.val,
                    emission.max, state.getWeight(comp))
        elif emission.type == ghmmwrapper.uniform:
            return (emission.type, emission.max, emission.min, state.getWeight(comp))

    def setEmission(self, i, comp, distType, values):
        """ Set the emission distribution parameters for a mixture component
        of a single state.

        @param i index of a state
        @param comp index of a mixture component
        @param distType type of the distribution
        @param values tuple (mu, sigma, a , weight) and is interpreted depending
        on distType
        - mu     - mean for normal, normal_approx, normal_right, normal_left
        - mu     - max for uniform
        - sigma  - standard deviation for normal, normal_approx, normal_right,
          normal_left
        - sigma  - min for uniform
        - a      - cut-off normal_right and normal_left
        - weight - always component weight
        """

        mu, sigma, a, weight = values
        i = self.state(i)

        state = self.cmodel.getState(i)
        state.setWeight(comp, float(weight))
        emission = state.getEmission(comp)
        emission.type = distType
        if (emission.type == ghmmwrapper.normal or
            emission.type == ghmmwrapper.normal_approx or
            emission.type == ghmmwrapper.normal_right or
            emission.type == ghmmwrapper.normal_left):
            emission.mean.val = mu
            emission.variance.val = sigma
            if emission.type == ghmmwrapper.normal_right:
                emission.min = a
            if emission.type == ghmmwrapper.normal_left:
                emission.max = a
        elif emission.type == ghmmwrapper.uniform:
            emission.min = sigma
            emission.max = mu
        else:
            raise TypeError("Unknown distribution type" + str(distType))

    def __str__(self):
        """ defines string representation """
        return "<ContinuousMixtureHMM with "+str(self.cmodel.N)+" states>"

    def verboseStr(self):
        """ Human readable model description """
        hmm = self.cmodel

        strout = ["\nOverview of HMM:"]
        strout.append("\nNumber of states: "+ str(hmm.N))
        strout.append("\nMaximum number of output distributions per state: "+ str(hmm.M))

        for k in range(hmm.N):
            state = hmm.getState(k)
            strout.append("\n\nState number "+ str(k) +":")
            if(state.desc is not None):
                strout.append("\nState Name: " + state.desc)
            strout.append("\n  Initial probability: " + str(state.pi))
            strout.append("\n  "+ str(state.M) + " density function(s):")

            for outp in range(state.M):
                comp_str = "\n    " + str(state.getWeight(outp)) + " * "
                emission = state.getEmission(outp)
                type = emission.type
                if type == ghmmwrapper.normal:
                    comp_str += "normal(mean = " + str(emission.mean.val)
                    comp_str += ", variance = " + str(emission.variance.val) + ")"
                elif type == ghmmwrapper.normal_right:
                    comp_str += "normal right tail(mean = " + str(emission.mean.val)
                    comp_str += ", variance = " + str(emission.variance.val)
                    comp_str += ", minimum = " + str(emission.min) + ")"
                elif type == ghmmwrapper.normal_left:
                    comp_str += "normal left tail(mean = " + str(emission.mean.val)
                    comp_str += ", variance = " + str(emission.variance.val)
                    comp_str += ", maximum = " + str(emission.max) + ")"
                elif type == ghmmwrapper.uniform:
                    comp_str += "uniform(minimum = " + str(emission.min)
                    comp_str += ", maximum = " + str(emission.max) + ")"

                strout.append(comp_str)

            for c in range(self.cmodel.cos):
                strout.append("\n  Class : " + str(c))
                strout.append("\n    Outgoing transitions:")
                for i in range( state.out_states):
                    strout.append("\n      transition to state " + str(state.getOutState(i)) +
                                  " with probability = " + str(state.getOutProb(i, c)))

                strout.append("\n    Ingoing transitions:")
                for i in range(state.in_states):
                    strout.append("\n      transition from state " + str(state.getInState(i)) +
                                  " with probability = "+ str(state.getInProb(i, c)))

            strout.append("\n  int fix:" + str(state.fix))

        strout.append("\n")
        return join(strout,'')

    def asMatrices(self):
        """Return the parameters in matrix form.
           It also returns the density type"""
        # XXX inherit transitions ????

        A = []
        B = []
        pi = []
        d = []
        for i in range(self.cmodel.N):
            A.append([0.0] * self.N)
            B.append([])
            state = self.cmodel.getState(i)
            pi.append(state.pi)
            denList = []

            parlist = []
            for j in range(state.M):
                emission = state.getEmission(j)
                denList.append(emission.type)
                if emission.type == ghmmwrapper.normal:
                    parlist.append([emission.mean.val, emission.variance.val,
                                    0, state.getWeight(j)])
                elif emission.type == ghmmwrapper.normal_right:
                    parlist.append([emission.mean.val, emission.variance.val,
                                    emission.min, state.getWeight(j)])
                elif emission.type == ghmmwrapper.normal_left:
                    parlist.append([emission.mean.val, emission.variance.val,
                                    emission.max, state.getWeight(j)])
                elif emission.type == ghmmwrapper.uniform:
                    parlist.append([emission.max, emission.min, 0, state.getWeight(j)])
                else:
                    raise TypeError("Unsupported distribution" + str(emission.type))

            for j in range(4):
                B[i].append([l[j] for l in parlist])

            d.append(denList)

            for j in range(state.out_states):
                state_index = state.getOutState(j)
                A[i][state_index] = ghmmwrapper.double_matrix_getitem(state.out_a,0,j)

        return [A,B,pi,d]


class MultivariateGaussianMixtureHMM(GaussianEmissionHMM):
    """ HMMs with Multivariate Gaussian distribution as emissions.

    States can have multiple mixture components.
    """

    def __init__(self, emissionDomain, distribution, cmodel):
        HMM.__init__(self, emissionDomain, distribution, cmodel)

        # Baum Welch context, call baumWelchSetup to initalize
        self.BWcontext = ""

    def getEmission(self, i, m):
        """
        @returns mean and covariance matrix of component m in state i
        """
        i = self.state(i)
        state = self.cmodel.getState(i)
        assert 0 <=m < state.M, "Index " + str(m) + " out of bounds."

        emission = state.getEmission(m)
        mu = ghmmwrapper.double_array2list(emission.mean.vec,emission.dimension)
        sigma = ghmmwrapper.double_array2list(emission.variance.mat,emission.dimension*emission.dimension)
        return (mu, sigma)

    def setEmission(self, i, m, values):
        """ Set the emission distributionParameters for mixture component m in
        state i

        @param i index of a state
        @param m index of a mixture component
        @param values tuple of mu, sigma
        """

        mu, sigma = values
        i = self.state(i)

        state = self.cmodel.getState(i)
        assert 0 <=m < state.M, "Index " + str(m) + " out of bounds."

        emission = state.getEmission(m)
        emission.mean.vec = ghmmwrapper.list2double_array(mu)
        emission.variance.mat = ghmmwrapper.list2double_array(sigma)

    def __str__(self):
        hmm = self.cmodel
        strout = ["\nHMM Overview:"]
        strout.append("\nNumber of states: " + str(hmm.N))
        strout.append("\nmaximum Number of mixture components: " + str(hmm.M))
        strout.append("\nNumber of dimensions: " + str(hmm.dim))

        for k in range(hmm.N):
            state = hmm.getState(k)
            strout.append("\n\nState number "+ str(k) + ":")
            strout.append("\nInitial probability: " + str(state.pi))
            strout.append("\nNumber of mixture components: " + str(state.M))

            for m in range(state.M):
                strout.append("\n\n  Emission number "+ str(m) + ":")

                weight = ""
                mue = ""
                u =  ""
                uinv = ""
                ucd = ""

                weight += str(ghmmwrapper.double_array_getitem(state.c,m))

                emission = state.getEmission(m)
                mue += str(ghmmwrapper.double_array2list(emission.mean.vec,emission.dimension))
                u += str(ghmmwrapper.double_array2list(emission.variance.mat,emission.dimension*emission.dimension))
                uinv += str(ghmmwrapper.double_array2list(emission.sigmainv,emission.dimension*emission.dimension))
                ucd += str(ghmmwrapper.double_array2list(emission.sigmacd,emission.dimension*emission.dimension))

                strout.append("\n    emission type: " + str(emission.type))
                strout.append("\n    emission weight: " + str(weight))
                strout.append("\n    mean: " + str(mue))
                strout.append("\n    covariance matrix: " + str(u))
                strout.append("\n    inverse of covariance matrix: " + str(uinv))
                strout.append("\n    determinant of covariance matrix: " + str(emission.det))
                strout.append("\n    cholesky decomposition of covariance matrix: " + str(ucd))
                strout.append("\n    fix: " + str(state.fix))

            for c in range(self.cmodel.cos):
                strout.append("\n\n  Class : " + str(c)                )
                strout.append("\n    Outgoing transitions:")
                for i in range( state.out_states):
                    strout.append("\n      transition to state " + str(state.getOutState(i) ) + " with probability = " + str(state.getOutProb(i, c)))
                strout.append("\n    Ingoing transitions:")
                for i in range(state.in_states):
                    strout.append("\n      transition from state " + str(state.getInState(i)) +" with probability = "+ str(state.getInProb(i, c)))

        return join(strout,'')

    def asMatrices(self):
        "Return the parameters in matrix form."
        A = []
        B = []
        pi = []
        for i in range(self.cmodel.N):
            A.append([0.0] * self.N)
            emissionparams = []
            state = self.cmodel.getState(i)
            pi.append(state.pi)
            for m in range(state.M):
                emission = state.getEmission(m)
                mu = ghmmwrapper.double_array2list(emission.mean.vec,emission.dimension)
                sigma = ghmmwrapper.double_array2list(emission.variance.mat,(emission.dimension*emission.dimension))
                emissionparams.append(mu)
                emissionparams.append(sigma)

            if state.M > 1:
                weights = ghmmwrapper.double_array2list(state.c,state.M)
                emissionparams.append(weights)

            B.append(emissionparams)

            for j in range(state.out_states):
                state_index = ghmmwrapper.int_array_getitem(state.out_id, j)
                A[i][state_index] = ghmmwrapper.double_matrix_getitem(state.out_a,0,j)

        return [A,B,pi]


def HMMDiscriminativeTraining(HMMList, SeqList, nrSteps = 50, gradient = 0):
    """ Trains a couple of HMMs to increase the probablistic distance
    if the the HMMs are used as classifier.

    @param HMMList List of labeled HMMs
    @param SeqList List of labeled sequences, one for each HMM
    @param nrSteps maximal number of iterations
    @param gradient @todo document me

    @note this method does a initial expectation maximization training
    """

    if len(HMMList) != len(SeqList):
        raise TypeError('Input list are not equally long')

    if not isinstance(HMMList[0], StateLabelHMM):
        raise TypeError('Input is not a StateLabelHMM')

    if not SeqList[0].hasStateLabels:
        raise TypeError('Input sequence has no labels')

    inplen = len(HMMList)
    if gradient not in [0, 1]:
        raise UnknownInputType("TrainingType " + gradient + " not supported.")

    for i in range(inplen):
        if HMMList[i].emissionDomain.CDataType == "double":
            raise TypeError('discriminative training is at the moment only implemented on discrete HMMs')
        #initial training with Baum-Welch
        HMMList[i].baumWelch(SeqList[i], 3, 1e-9)

    HMMArray = ghmmwrapper.dmodel_ptr_array_alloc(inplen)
    SeqArray = ghmmwrapper.dseq_ptr_array_alloc(inplen)

    for i in range(inplen):
        ghmmwrapper.dmodel_ptr_array_setitem(HMMArray, i, HMMList[i].cmodel)
        ghmmwrapper.dseq_ptr_array_setitem(SeqArray, i, SeqList[i].cseq)

    ghmmwrapper.ghmm_dmodel_label_discriminative(HMMArray, SeqArray, inplen, nrSteps, gradient)

    for i in range(inplen):
        HMMList[i].cmodel = ghmmwrapper.dmodel_ptr_array_getitem(HMMArray, i)
        SeqList[i].cseq   = ghmmwrapper.dseq_ptr_array_getitem(SeqArray, i)

    ghmmwrapper.free(HMMArray)
    ghmmwrapper.free(SeqArray)

    return HMMDiscriminativePerformance(HMMList, SeqList)



def HMMDiscriminativePerformance(HMMList, SeqList):
    """ Computes the discriminative performce of the HMMs in HMMList
    under the sequences in SeqList
    """

    if len(HMMList) != len(SeqList):
        raise TypeRrror('Input list are not equally long')

    if not isinstance(HMMList[0], StateLabelHMM):
        raise TypeError('Input is not a StateLabelHMM')

    if not SeqList[0].hasStateLabels:
        raise TypeError('Input sequence has no labels')

    inplen = len(HMMList)

    single = [0.0] * inplen

    HMMArray = ghmmwrapper.dmodel_ptr_array_alloc(inplen)
    SeqArray = ghmmwrapper.dseq_ptr_array_alloc(inplen)

    for i in range(inplen):
        ghmmwrapper.dmodel_ptr_array_setitem(HMMArray, i, HMMList[i].cmodel)
        ghmmwrapper.dseq_ptr_array_setitem(SeqArray, i, SeqList[i].cseq)

    retval = ghmmwrapper.ghmm_dmodel_label_discrim_perf(HMMArray, SeqArray, inplen)

    ghmmwrapper.free(HMMArray)
    ghmmwrapper.free(SeqArray)

    return retval

########## Here comes all the Pair HMM stuff ##########
class DiscretePairDistribution(DiscreteDistribution):
    """
    A DiscreteDistribution over TWO Alphabets: The discrete distribution
    is parameterized by the vector of probabilities.
    To get the index of the vector that corresponds to a pair of characters
    use the getPairIndex method.

    """

    def __init__(self, alphabetX, alphabetY, offsetX, offsetY):
        """
        construct a new DiscretePairDistribution
        @param alphabetX Alphabet object for sequence X
        @param alphabetY Alphabet object for sequence Y
        @param offsetX number of characters the alphabet of sequence X
        consumes at a time
        @param offsetY number of characters the alphabet of sequence Y
        consumes at a time
        """
        self.alphabetX = alphabetX
        self.alphabetY = alphabetY
        self.offsetX = offsetX
        self.offsetY = offsetY
        self.prob_vector = None
        self.pairIndexFunction = ghmmwrapper.ghmm_dpmodel_pair

    def getPairIndex(self, charX, charY):
        """
        get the index of a pair of two characters in the probability vector
        (if you use the int representation both values must be ints)
        @param charX character chain or int representation
        @param charY character chain or int representation
        @return the index of the pair in the probability vector
        """
        if (not (type(charX) == type(1) and type(charY) == type(1))):
            if (charX == "-"):
                intX = 0 # check this!
            else:
                intX = self.alphabetX.internal(charX)
            if (charY == "-"):
                intY = 0 # check this!
            else:
                intY = self.alphabetY.internal(charY)
        else:
            intX = charX
            intY = charY
        return self.pairIndexFunction(intX, intY,
                                      len(self.alphabetX),
                                      self.offsetX, self.offsetY)

    def setPairProbability(self, charX, charY, probability):
        """
        set the probability of the [air charX and charY to probability
        @param charX character chain or int representation
        @param charY character chain or int representation
        @param probability probability (0<=float<=1)
        """
        self.prob_vector[self.getPairIndex(charX, charY)] = probability

    def getEmptyProbabilityVector(self):
        """
        get an empty probability vector for this distribution (filled with 0.0)
        @return list of floats
        """
        length = self.pairIndexFunction(len(self.alphabetX) - 1,
                                        len(self.alphabetY) - 1,
                                        len(self.alphabetX),
                                        self.offsetX, self.offsetY) + 1
        return [0.0 for i in range(length)]

    def getCounts(self, sequenceX, sequenceY):
        """
        extract the pair counts for aligned sequences sequenceX and sequenceY
        @param sequenceX string for sequence X
        @param sequenceY strinf for sequence Y
        @return a list of counts
        """
        counts = self.getEmptyProbabilityVector()
        if (self.offsetX != 0 and self.offsetY != 0):
            assert len(sequenceX) / self.offsetX == len(sequenceY) / self.offsetY
            for i in range(len(sequenceX) / self.offsetX):
                charX = sequenceX[i*self.offsetX:(i+1)*self.offsetX]
                charY = sequenceY[i*self.offsetY:(i+1)*self.offsetY]
                counts[self.getPairIndex(charX, charY)] += 1
            return counts
        elif (self.offsetX == 0 and self.offsetY == 0):
            log.error( "Silent states (offsetX==0 and offsetY==0) not supported")
            return counts
        elif (self.offsetX == 0):
            charX = "-"
            for i in range(len(sequenceY) / self.offsetY):
                charY = sequenceY[i*self.offsetY:(i+1)*self.offsetY]
                counts[self.getPairIndex(charX, charY)] += 1
            return counts
        elif (self.offsetY == 0):
            charY = "-"
            for i in range(len(sequenceX) / self.offsetX):
                charX = sequenceX[i*self.offsetX:(i+1)*self.offsetX]
                counts[self.getPairIndex(charX, charY)] += 1
            return counts


# XXX Change to MultivariateEmissionSequence
class ComplexEmissionSequence(object):
    """
    A MultivariateEmissionSequence is a sequence of multiple emissions per
    time-point. Emissions can be from distinct EmissionDomains. In particular,
    integer and floating point emissions are allowed. Access to emissions is
    given by the index, seperately for discrete and continuous EmissionDomains.

    Example: XXX

    MultivariateEmissionSequence also links to the underlying C-structure.

    Note: ComplexEmissionSequence has to be considered imutable for the moment.
    There are no means to manipulate the sequence positions yet.
    """

    def __init__(self, emissionDomains, sequenceInputs, labelDomain = None, labelInput = None):
        """
        @param emissionDomains a list of EmissionDomain objects corresponding
        to the list of sequenceInputs
        @param sequenceInputs a list of sequences of the same length (e.g.
        nucleotides and double values) that will be encoded
        by the corresponding EmissionDomain
        @bug @param labelDomain unused
        @bug @param labelInput unused
        """
        assert len(emissionDomains) == len(sequenceInputs)
        assert len(sequenceInputs) > 0
        self.length = len(sequenceInputs[0])
        for sequenceInput in sequenceInputs:
            assert self.length == len(sequenceInput)

        self.discreteDomains = []
        self.discreteInputs = []
        self.continuousDomains = []
        self.continuousInputs = []
        for i in range(len(emissionDomains)):
            if emissionDomains[i].CDataType == "int":
                self.discreteDomains.append(emissionDomains[i])
                self.discreteInputs.append(sequenceInputs[i])
            if emissionDomains[i].CDataType == "double":
                self.continuousDomains.append(emissionDomains[i])
                self.continuousInputs.append(sequenceInputs[i])

        self.cseq = ghmmwrapper.ghmm_dpseq(self.length,
                                           len(self.discreteDomains),
                                           len(self.continuousDomains))

        for i in range(len(self.discreteInputs)):
            internalInput = []
            offset = self.discreteDomains[i].getExternalCharacterLength()
            if (offset == None):
                internalInput = self.discreteDomains[i].internalSequence(self.discreteInputs[i])
            else:
                if (type(self.discreteInputs[i]) == type([])):
                    # we have string sequences with equally large characters so
                    # we can join the list representation
                    self.discreteInputs[i] = ("").join(self.discreteInputs[i])

                for j in range(offset - 1):
                    internalInput.append(-1) # put -1 at the start
                for j in range(offset-1, len(self.discreteInputs[i])):
                    internalInput.append(self.discreteDomains[i].internal(
                        self.discreteInputs[i][j-(offset-1):j+1]))
            pointerDiscrete = self.cseq.get_discrete(i)
            for j in range(len(self)):
                ghmmwrapper.int_array_setitem(pointerDiscrete, j, internalInput[j])
            # self.cseq.set_discrete(i, seq)

        for i in range(len(self.continuousInputs)):
            #seq = [float(x) for x in self.continuousInputs[i]]
            #seq = ghmmwrapper.list2double_array(seq)
            pointerContinuous = self.cseq.get_continuous(i)
            for j in range(len(self)):
                ghmmwrapper.double_array_setitem(pointerContinuous, j, self.continuousInputs[i][j])
            # self.cseq.set_continuous(i, seq)

    def __del__(self):
        """
        Deallocation of C sequence struct.
        """
        del self.cseq
        self.cseq = None

    def __len__(self):
        """
        @return the length of the sequence.
        """
        return self.length

    def getInternalDiscreteSequence(self, index):
        """
        access the underlying C structure and return the internal
        representation of the discrete sequence number 'index'
        @param index number of the discrete sequence
        @return a python list of ints
        """
        int_pointer = self.cseq.get_discrete(index)
        internal = ghmmwrapper.int_array2list(int_pointer, len(self))
        int_pointer = None
        return internal

    def getInternalContinuousSequence(self, index):
        """
        access the underlying C structure and return the internal
        representation of the continuous sequence number 'index'
        @param index number of the continuous sequence
        @return a python list of floats
        """
        d_pointer = self.cseq.get_continuous(index)
        internal = ghmmwrapper.double_array2list(d_pointer, len(self))
        return internal

    def getDiscreteSequence(self, index):
        """
        get the 'index'th discrete sequence as it has been given at the input
        @param index number of the discrete sequence
        @return a python sequence
        """
        return self.discreteInputs[index]

    def __getitem__(self, key):
        """
        get a slice of the complex emission sequence
        @param key either int (makes no big sense) or slice object
        @return a new ComplexEmissionSequence containing a slice of the
        original
        """
        domains = []
        for domain in self.discreteDomains:
            domains.append(domain)
        for domain in self.continuousDomains:
            domains.append(domain)
        slicedInput = []
        for input in self.discreteInputs:
            slicedInput.append(input[key])
        for input in self.continuousInputs:
            slicedInput.append(input[key])
        return ComplexEmissionSequence(domains, slicedInput)

    def __str__(self):
        """
        string representation. Access the underlying C-structure and return
        the sequence in all it's encodings (can be quite long)
        @return string representation
        """
        return "<ComplexEmissionSequence>"

    def verboseStr(self):
        """
        string representation. Access the underlying C-structure and return
        the sequence in all it's encodings (can be quite long)
        @return string representation
        """
        s = ("ComplexEmissionSequence (len=%i, discrete=%i, continuous=%i)\n"%
             (self.cseq.length, len(self.discreteDomains),
              len(self.continuousDomains)))
        for i in range(len(self.discreteDomains)):
            s += ("").join([str(self.discreteDomains[i].external(x))
                            for x in self.getInternalDiscreteSequence(i)])
            s += "\n"
        for i in range(len(self.continuousDomains)):
            s += (",").join([str(self.continuousDomains[i].external(x))
                            for x in self.getInternalContinuousSequence(i)])
            s += "\n"
        return s

class PairHMM(HMM):
    """
    Pair HMMs with discrete emissions over multiple alphabets.
    Optional features: continuous values for transition classes
    """
    def __init__(self, emissionDomains, distribution, cmodel):
        """
        create a new PairHMM object (this should only be done using the
        factory: e.g model = PairHMMOpenXML(modelfile) )
        @param emissionDomains list of EmissionDomain objects
        @param distribution (not used) inherited from HMM
        @param cmodel a swig pointer on the underlying C structure
        """
        HMM.__init__(self, emissionDomains[0], distribution, cmodel)
        self.emissionDomains = emissionDomains
        self.alphabetSizes = []
        for domain in self.emissionDomains:
            if (isinstance(domain, Alphabet)):
                self.alphabetSizes.append(len(domain))

        self.maxSize = 10000
        self.model_type = self.cmodel.model_type  # model type
        self.background = None

        self.states = {}

    def __str__(self):
        """
        string representation (more for debuging) shows the contents of the C
        structure ghmm_dpmodel
        @return string representation
        """
        return "<PairHMM with " + str(self.cmodel.N) + " states>"

    def verboseStr(self):
        """
        string representation (more for debuging) shows the contents of the C
        structure ghmm_dpmodel
        @return string representation
        """
        hmm = self.cmodel
        strout = ["\nGHMM Model\n"]
        strout.append("Name: " + str(self.cmodel.name))
        strout.append("\nModelflags: "+ self.printtypes(self.cmodel.model_type))
        strout.append("\nNumber of states: "+ str(hmm.N))
        strout.append("\nSize of Alphabet: "+ str(hmm.M))
        for k in range(hmm.N):
            state = hmm.getState(k)
            strout.append("\n\nState number "+ str(k) +":")
            if(state.desc is not None):
                strout.append("\nState Name: " + state.desc)
            strout.append("\nInitial probability: " + str(state.pi))
            strout.append("\nOutput probabilites: ")
            #strout.append(str(ghmmwrapper.double_array_getitem(state.b,outp)))
            strout.append("\n")

            strout.append("\nOutgoing transitions:")
            for i in range( state.out_states):
                strout.append("\ntransition to state " + str(state.out_id[i]) + " with probability " + str(ghmmwrapper.double_array_getitem(state.out_a,i)))
            strout.append("\nIngoing transitions:")
            for i in range(state.in_states):
                strout.append("\ntransition from state " + str(state.in_id[i]) + " with probability " + str(ghmmwrapper.double_array_getitem(state.in_a,i)))
                strout.append("\nint fix:" + str(state.fix) + "\n")

        if hmm.model_type & kSilentStates:
            strout.append("\nSilent states: \n")
            for k in range(hmm.N):
                strout.append(str(hmm.silent[k]) + ", ")
            strout.append("\n")

        return join(strout,'')


    def viterbi(self, complexEmissionSequenceX, complexEmissionSequenceY):
        """
        run the naive implementation of the Viterbi algorithm and
        return the viterbi path and the log probability of the path
        @param complexEmissionSequenceX sequence X encoded as ComplexEmissionSequence
        @param complexEmissionSequenceY sequence Y encoded as ComplexEmissionSequence
        @return (path, log_p)
        """
        # get a pointer on a double and a int to get return values by reference
        log_p_ptr = ghmmwrapper.double_array_alloc(1)
        length_ptr = ghmmwrapper.int_array_alloc(1)
        # call log_p and length will be passed by reference
        cpath = self.cmodel.viterbi(complexEmissionSequenceX.cseq,
                                    complexEmissionSequenceY.cseq,
                                    log_p_ptr, length_ptr)
        # get the values from the pointers
        log_p = ghmmwrapper.double_array_getitem(log_p_ptr, 0)
        length = length_ptr[0]
        path = [cpath[x] for x in range(length)]
        # free the memory
        ghmmwrapper.free(log_p_ptr)
        ghmmwrapper(length_ptr)
        ghmmwrapper.free(cpath)
        return (path, log_p)

    def viterbiPropagate(self, complexEmissionSequenceX, complexEmissionSequenceY, startX=None, startY=None, stopX=None, stopY=None, startState=None, startLogp=None, stopState=None, stopLogp=None):
        """
        run the linear space implementation of the Viterbi algorithm and
        return the viterbi path and the log probability of the path
        @param complexEmissionSequenceX sequence X encoded as ComplexEmissionSequence
        @param complexEmissionSequenceY sequence Y encoded as ComplexEmissionSequence
        Optional parameters to run the algorithm only on a segment:
        @param startX start index in X
        @param startY start index in Y
        @param stopX stop index in X
        @param stopY stop index in Y
        @param startState start the path in this state
        @param stopState path ends in this state
        @param startLogp initialize the start state with this log probability
        @param stopLogp if known this is the logp of the partial path
        @return (path, log_p)
        """
        # get a pointer on a double and a int to get return values by reference
        log_p_ptr = ghmmwrapper.double_array_alloc(1)
        length_ptr = ghmmwrapper.int_array_alloc(1)
        # call log_p and length will be passed by reference
        if (not (startX and startY and stopX and stopY and startState and stopState and startLogp)):
            cpath = self.cmodel.viterbi_propagate(
                complexEmissionSequenceX.cseq,
                complexEmissionSequenceY.cseq,
                log_p_ptr, length_ptr,
                self.maxSize)
        else:
            if (stopLogp == None):
                stopLogp = 0
            cpath = self.cmodel.viterbi_propagate_segment(
                complexEmissionSequenceX.cseq,
                complexEmissionSequenceY.cseq,
                log_p_ptr, length_ptr, self.maxSize,
                startX, startY, stopX, stopY, startState, stopState,
                startLogp, stopLogp)

        # get the values from the pointers
        log_p = ghmmwrapper.double_array_getitem(log_p_ptr, 0)
        length = length_ptr[0]
        path = [cpath[x] for x in range(length)]
        # free the memory
        ghmmwrapper.free(log_p_ptr)
        ghmmwrapper.free(length_ptr)
        ghmmwrapper.free(cpath)
        return (path, log_p)

    def logP(self, complexEmissionSequenceX, complexEmissionSequenceY, path):
        """
        compute the log probability of two sequences X and Y and a path
        @param complexEmissionSequenceX sequence X encoded as
        ComplexEmissionSequence
        @param complexEmissionSequenceY sequence Y encoded as
        ComplexEmissionSequence
        @param path the state path
        @return log probability
        """
        cpath = ghmmwrapper.list2int_array(path)
        logP = self.cmodel.viterbi_logP(complexEmissionSequenceX.cseq,
                                 complexEmissionSequenceY.cseq,
                                 cpath, len(path))
        ghmmwrapper.free(cpath)
        return logP

    def addEmissionDomains(self, emissionDomains):
        """
        add additional EmissionDomains that are not specified in the XML file.
        This is used to add information for the transition classes.
        @param emissionDomains a list of EmissionDomain objects
        """
        self.emissionDomains.extend(emissionDomains)
        discreteDomains = []
        continuousDomains = []
        for i in range(len(emissionDomains)):
            if emissionDomains[i].CDataType == "int":
                discreteDomains.append(emissionDomains[i])
                self.alphabetSizes.append(len(emissionDomains[i]))
            if emissionDomains[i].CDataType == "double":
                continuousDomains.append(emissionDomains[i])

        self.cmodel.number_of_alphabets += len(discreteDomains)
        self.cmodel.size_of_alphabet = ghmmwrapper.list2int_array(self.alphabetSizes)

        self.cmodel.number_of_d_seqs += len(continuousDomains)

    def checkEmissions(self, eps=0.0000000000001):
        """
        checks the sum of emission probabilities in all states
        @param eps precision (if the sum is > 1 - eps it passes)
        @return 1 if the emission of all states sum to one, 0 otherwise
        """
        allok = 1
        for state in self.states:
            emissionSum = sum(state.emissions)
            if (abs(1 - emissionSum) > eps):
                log.debug(("Emissions in state %s (%s) do not sum to 1 (%s)" % (state.id, state.label, emissionSum)))
                allok = 0
        return allok

    def checkTransitions(self, eps=0.0000000000001):
        """
        checks the sum of outgoing transition probabilities for all states
        @param eps precision (if the sum is > 1 - eps it passes)
        @return 1 if the transitions of all states sum to one, 0 otherwise
        """
        allok = 1
        # from build matrices in xmlutil:
        orders = {}
        k = 0 # C style index
        for s in self.states: # ordering from XML
            orders[s.index] = k
            k = k + 1

        for state in self.states:
            for tclass in range(state.kclasses):
                outSum = 0.0
                c_state = self.cmodel.getState(orders[state.index])
                for out in range(c_state.out_states):
                    outSum += ghmmwrapper.double_matrix_getitem(c_state.out_a,
                                                        out, tclass)

                if (abs(1 - outSum) > eps):
                    log.debug("Outgoing transitions in state %s (%s) do not sum to 1 (%s) for class %s" % (state.id, state.label, outSum, tclass))
                    allok = 0
        return allok

class PairHMMOpenFactory(HMMOpenFactory):
    """
    factory to create PairHMM objects from XML files
    """
    def __call__(self, fileName_file_or_dom, modelIndex = None):
        """
        a call to the factory loads a model from a file specified by the
        filename or from a file object or from a XML Document object and
        initializes the model on the C side (libghmm).
        @param fileName_file_or_dom load the model from a file specified by
        a filename, a file object or a XML Document object
        @param modelIndex not used (inherited from HMMOpenFactory)
        @return PairHMM object
        """
        import xml.dom.minidom
        from ghmm_gato import xmlutil

        if not (isinstance(fileName_file_or_dom, io.StringIO) or
                isinstance(fileName_file_or_dom, xml.dom.minidom.Document)):
            if not os.path.exists(fileName_file_or_dom):
                raise IOError('File ' + str(fileName_file_or_dom) + ' not found.')

        hmm_dom = xmlutil.HMM(fileName_file_or_dom)
        if (not hmm_dom.modelType == "pairHMM"):
            raise InvalidModelParameters("Model type specified in the XML file (%s) is not pairHMM" % hmm_dom.modelType)
        # obviously it's a pair HMM
        [alphabets, A, B, pi, state_orders] = hmm_dom.buildMatrices()
        if not len(A) == len(A[0]):
            raise InvalidModelParameters("A is not quadratic.")
        if not len(pi) == len(A):
            raise InvalidModelParameters("Length of pi does not match length of A.")
        if not len(A) == len(B):
            raise InvalidModelParameters("Different number of entries in A and B.")

        cmodel = ghmmwrapper.ghmm_dp_init()
        cmodel.N = len(A)
        cmodel.M = -1 # no use anymore len(emissionDomain)

        # tie groups are deactivated by default
        cmodel.tied_to = None

        # assign model identifier (if specified)
        if hmm_dom.name != None:
            cmodel.name = hmm_dom.name
        else:
            cmodel.name = 'Unused'

        alphabets = hmm_dom.getAlphabets()
        cmodel.number_of_alphabets = len(list(alphabets.keys()))
        sizes = [len(alphabets[k]) for k in list(alphabets.keys())]
        cmodel.size_of_alphabet = ghmmwrapper.list2int_array(sizes)

        # set number of d_seqs to zero. If you want to use them you have to
        # set them manually
        cmodel.number_of_d_seqs = 0

        # c array of states allocated
        cstates = ghmmwrapper.dpstate_array_alloc(cmodel.N)
        # python list of states from xml
        pystates = list(hmm_dom.state.values())

        silent_flag = 0
        silent_states = []

        maxOffsetX = 0
        maxOffsetY = 0

        transitionClassFlag = 0
        maxTransitionIndexDiscrete = len(list(alphabets.keys()))
        maxTransitionIndexContinuous = 0

        # from build matrices in xmlutil:
        orders = {}
        k = 0 # C style index
        for s in pystates: # ordering from XML
            orders[s.index] = k
            k = k + 1

        #initialize states
        for i in range(cmodel.N):
            cstate = ghmmwrapper.dpstate_array_getitem(cstates, i)
            pystate = pystates[i]
            size = len(pystate.itsHMM.hmmAlphabets[pystate.alphabet_id])
            if (pystate.offsetX != 0 and pystate.offsetY != 0):
                size = size**2
            if (len(B[i]) != size):
                raise InvalidModelParameters("in state %s len(emissions) = %i size should be %i" % (pystate.id, len(B[i]), size))
            cstate.b = ghmmwrapper.list2double_array(B[i])
            cstate.pi = pi[i]
            if (pi[i] != 0):
                cstate.log_pi = math.log(pi[i])
            else:
                cstate.log_pi = 1

            cstate.alphabet = pystate.alphabet_id
            cstate.offset_x = pystate.offsetX
            cstate.offset_y = pystate.offsetY
            cstate.kclasses = pystate.kclasses

            if (pystate.offsetX > maxOffsetX):
                maxOffsetX = pystate.offsetX
            if (pystate.offsetY > maxOffsetY):
                maxOffsetY = pystate.offsetY

            if (sum(B[i]) == 0 ):
                silent_states.append(1)
                silent_flag = 4
            else:
                silent_states.append(0)

                # transition probability
                # cstate.out_states, cstate.out_id, out_a = ghmmhelper.extract_out(A[i])
                v = pystate.index
                #print "C state index: %i pystate index: %i order: %i" % (i, v, orders[v])
                outprobs = []
                for j in range(len(hmm_dom.G.OutNeighbors(v))):
                    outprobs.append([0.0] * pystate.kclasses)
                myoutid = []
                j = 0
                for outid in hmm_dom.G.OutNeighbors(v):
                    myorder = orders[outid]
                    myoutid.append(myorder)
                    for tclass in range(pystate.kclasses):
                        outprobs[j][tclass] = hmm_dom.G.edgeWeights[tclass][(v,outid)]
                    j += 1
                cstate.out_states = len(myoutid)
                cstate.out_id = ghmmwrapper.list2int_array(myoutid)
                (cstate.out_a, col_len) = ghmmhelper.list2double_matrix(outprobs)
                #set "in" probabilities
                # A_col_i = map( lambda x: x[i], A)
                # Numarray use A[,:i]
                # cstate.in_states, cstate.in_id, cstate.in_a = ghmmhelper.extract_out(A_col_i)
                inprobs = []
                for inid in hmm_dom.G.InNeighbors(v):
                    myorder = orders[inid]
                    # for every class in source
                    inprobs.append([0.0] * pystates[myorder].kclasses)
                myinid = []
                j = 0
                for inid in hmm_dom.G.InNeighbors(v):
                    myorder = orders[inid]
                    myinid.append(myorder)
                    # for every transition class of the source state add a prob
                    for tclass in range(pystates[myorder].kclasses):
                        inprobs[j][tclass] = hmm_dom.G.edgeWeights[tclass][(inid,v)]
                    j += 1

                j = 0
                #for inid in myinid:
                #    print "Transitions (%i, %i)" % (inid ,i)
                #    print inprobs[j]
                #    j += 1

                cstate.in_states = len(myinid)
                cstate.in_id = ghmmwrapper.list2int_array(myinid)
                (cstate.in_a, col_len) = ghmmhelper.list2double_matrix(inprobs)
                #fix probabilities by reestimation, else 0
                cstate.fix = 0

                # set the class determination function
                cstate.class_change = ghmmwrapper.ghmm_dp_init_class_change()
                if (pystate.transitionFunction != -1):
                    transitionClassFlag = 1
                    tf = hmm_dom.transitionFunctions[pystate.transitionFunction]
                    # for the moment: do not use the offsets because they
                    # add the risk of segmentation faults at the ends of
                    # the loops or neccessitate index checks at every query
                    # which is not desirable because the transition
                    # functions are used in every iteration. Instead use
                    # shifted input values!
                    if (tf.type == "lt_sum"):
                        ghmmwrapper.set_to_lt_sum(
                            cstate.class_change,
                            int(tf.paramDict["seq_index"]),
                            float(tf.paramDict["threshold"]),
                            0, # int(tf.paramDict["offset_x"]),
                            0) # int(tf.paramDict["offset_y"]))
                        maxTransitionIndexContinuous = max(
                            int(tf.paramDict["seq_index"]),
                            maxTransitionIndexContinuous)
                    elif (tf.type == "gt_sum"):
                        ghmmwrapper.set_to_gt_sum(
                            cstate.class_change,
                            int(tf.paramDict["seq_index"]),
                            float(tf.paramDict["threshold"]),
                            0, # int(tf.paramDict["offset_x"]),
                            0) # int(tf.paramDict["offset_y"]))
                        maxTransitionIndexContinuous = max(
                            int(tf.paramDict["seq_index"]),
                            maxTransitionIndexContinuous)
                    elif (tf.type == "boolean_and"):
                        ghmmwrapper.set_to_boolean_and(
                            cstate.class_change,
                            int(tf.paramDict["seq_index"]),
                            0, # int(tf.paramDict["offset_x"]),
                            0) # int(tf.paramDict["offset_y"]))
                        maxTransitionIndexDiscrete = max(
                            int(tf.paramDict["seq_index"]),
                            maxTransitionIndexDiscrete)
                    elif (tf.type == "boolean_or"):
                        ghmmwrapper.set_to_boolean_or(
                            cstate.class_change,
                            int(tf.paramDict["seq_index"]),
                            0, # int(tf.paramDict["offset_x"]),
                            0) # int(tf.paramDict["offset_y"]))
                        maxTransitionIndexDiscrete = max(
                            int(tf.paramDict["seq_index"]),
                            maxTransitionIndexDiscrete)
                else:
                    ghmmwrapper.ghmm_dp_set_to_default_transition_class(cstate.class_change)

        cmodel.s = cstates

        cmodel.max_offset_x = maxOffsetX
        cmodel.max_offset_y = maxOffsetY

        cmodel.model_type += silent_flag
        cmodel.silent = ghmmwrapper.list2int_array(silent_states)
        distribution = DiscreteDistribution(DNA)
        emissionDomains = [Alphabet(list(hmm_dom.hmmAlphabets[alphabet].name.values())) for alphabet in alphabets]
        model = PairHMM(emissionDomains, distribution, cmodel)
        model.states = pystates
        model.transitionFunctions = hmm_dom.transitionFunctions
        model.usesTransitionClasses = transitionClassFlag
        model.alphabetSizes = sizes
        model.maxTransitionIndexContinuous = maxTransitionIndexContinuous
        model.maxTransitionIndexDiscrete = maxTransitionIndexDiscrete
        return model

PairHMMOpenXML = PairHMMOpenFactory()