SIS
Symmetric Index Structures
lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType > Class Template Reference

Template class DocumentIndexingAutomaton for the family of index structures of type AutomatonType (i.e. CDAWGs, SCDAWGs, etc.) More...

#include <DocumentIndexingAutomaton.hpp>

Inheritance diagram for lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >:
Collaboration diagram for lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >:

Public Member Functions

 DocumentIndexingAutomaton ()
 Constructor.
 DocumentIndexingAutomaton (UINT symbolsize)
 Constructor.
virtual ~DocumentIndexingAutomaton ()
 Desctructor.
virtual void AddDocuments (const std::vector< DocumentName > &documents)
 Add documents to the index structure DocumentIndexingAutomaton.
virtual void Index ()
 The indexing step.
virtual void Close ()
 The closing step.
virtual
DocumentIndexingAutomatonFindResults && 
findall (const std::string &w)
 start a search / find all occurences of a string
virtual void Shrink ()=0
virtual SinkState Add (const std::string &)=0
virtual SinkState Add (const S8 *string)=0
virtual SinkState Add (const VoidSequence *)=0
virtual UINT number_of_states () const =0
virtual UINT number_of_transitions () const =0
virtual CompressedAutomatonautomaton () const =0

Protected Member Functions

virtual
DocumentIndexingAutomatonFindResults && 
find (const PairState state, UINT length)
 protected method needed by findall(), re-adjusts to explicit state / canonical representative
virtual
DocumentIndexingAutomatonFindResults && 
findRec (const UINT s, const UINT length)
 protected method needed by find(), collects concrete positions in documents
virtual UINT suffixLink (const UINT &state) const
 Follow a suffixlink.
virtual void SortTransitions ()=0

Protected Attributes

ManagedAutomatonStage ManagedAutomatonStage_ = EMPTY
DocumentIndexingAutomatonMemberDescription

DocumentIndexingAutomaton's bookkeeping

Generally, the information that is needed to carry out lookups is

  • a delta function to "walk" to the appropriate state (this state represents and equivalence class)
  • the states information is read, which documents it is attached with (_states_documents)
  • from there, the offset (the position in the document) is calculated by:
    • taking into account the document's length (_document_length)
    • taking into account the pattern's length

To this end, all states need to have the document information attached in an indexing step, see Index() for more information.

See below for a specific description of each member

Todo:
Save pointers instead in these places or make more efficient (document names are saved twice).
std::map< DocumentName, State_document_sinkstate
 association document -> sinkstate
std::map< DocumentName,
PositionsTuple
_document_length
 association document -> document's length information (including current, absolute and actual positions in data)
std::vector< DocumentName_document_names
 a list of all documents that have been added to the automaton (template AutomatonType)
std::map< State, std::vector
< DocumentName > > 
_states_documents

Detailed Description

template<typename AutomatonType>
class lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >

Template class DocumentIndexingAutomaton for the family of index structures of type AutomatonType (i.e. CDAWGs, SCDAWGs, etc.)

Template Parameters:
AutomatonTypeDocumentIndexingAutomaton is a template class capable of adding document-indexing capbailities to the whole family of index structures (i.e. CDAWGs, SCDAWGs, etc.).

One way to achieve giving document-indexing capbalities to these classes would be inheritance for each one of the structures or – what is done in this case – to use generative programming to have these inheritances be created for you.

See also:
Templatization Description of Document Indexing Automata

Definition at line 42 of file DocumentIndexingAutomaton.hpp.


Constructor & Destructor Documentation

template<typename AutomatonType>
lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >::DocumentIndexingAutomaton ( ) [inline]

Constructor.

This constructor defaults and delegates to DocumentIndexingAutomaton(UINT symbolsize = 1)

Definition at line 98 of file DocumentIndexingAutomaton.hpp.

template<typename AutomatonType>
lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >::DocumentIndexingAutomaton ( UINT  symbolsize) [inline]

Constructor.

Parameters:
symbolsize

This constructor builds the templated AutomatonType (so, for instance, an SCDAWGAdapter) with DocumentIndexingAutomaton - capabilities

Note:
Since constructor delegation from the new standard C++11 is just not yet available in gcc, a "hack" needs to be made, in that the underlying CompressedAutomatonAdapter needs to be constructed first. This is needed to ensure that – although AutomatonType ultimately will be extending CompressedAutomatonAdapter –, the symbolsize is properly set in CompressedAutomatonAdapter.

Definition at line 112 of file DocumentIndexingAutomaton.hpp.

template<typename AutomatonType>
virtual lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >::~DocumentIndexingAutomaton ( ) [inline, virtual]

Desctructor.

Definition at line 121 of file DocumentIndexingAutomaton.hpp.


Member Function Documentation

virtual SinkState lmu::cis::sis::CompressedAutomatonAdapterInterface::Add ( const std::string &  ) [pure virtual, inherited]
virtual SinkState lmu::cis::sis::CompressedAutomatonAdapterInterface::Add ( const S8 string) [pure virtual, inherited]
virtual SinkState lmu::cis::sis::CompressedAutomatonAdapterInterface::Add ( const VoidSequence ) [pure virtual, inherited]
template<typename T >
void lmu::cis::sis::DocumentIndexingAutomaton< T >::AddDocuments ( const std::vector< DocumentName > &  documents) [virtual]

Add documents to the index structure DocumentIndexingAutomaton.

Precondition:
ManagedAutomatonStage_ < CLOSED
Postcondition:
ManagedAutomatonStage_ = UNSORTED

When adding documents, two important pieces of information are saved:

  • the document's sink state (each document is guaranteed its very own sinkstate in Ukkonen's algorithm)
    • (this information is needed in Index())
  • the document's length is recorded
    • (this information is needed for Blumer&Blumer's findall())

Implements lmu::cis::sis::DocumentIndexingAutomatonInterface.

Definition at line 19 of file DocumentIndexingAutomaton.cpp.

Referenced by main().

virtual CompressedAutomaton* lmu::cis::sis::CompressedAutomatonAdapterInterface::automaton ( ) const [inline, pure virtual, inherited]
template<typename T >
void lmu::cis::sis::DocumentIndexingAutomaton< T >::Close ( ) [virtual]

The closing step.

Precondition:
Any stage
Postcondition:
CLOSED stage

The closing step is a non-recoverable step, see DocumentIndexingAutomaton: Stages

Note:
This method will merely dispatch to the concerete automaton's Close() method, but will also set the DocumentIndexingAutomaton to CLOSED stage to make sure it is set.
See also:
DocumentIndexingAutomaton: Stages

Implements lmu::cis::sis::DocumentIndexingAutomatonInterface.

Definition at line 64 of file DocumentIndexingAutomaton.cpp.

Referenced by main().

template<typename T >
DocumentIndexingAutomatonFindResults && lmu::cis::sis::DocumentIndexingAutomaton< T >::find ( const PairState  state,
UINT  length 
) [protected, virtual]

protected method needed by findall(), re-adjusts to explicit state / canonical representative

When searching for all occurences of a pattern p, the chain goes like this: findall() calls find() calls findRec().

In this chain, findall() will pass a state to find(), regardless whether it is an implicit or an explicit state.

find(), as the second in the row, will re-position to the next explicit state (which is the canonical representative of findall()s search pattern) and adjust the current pattern's length accordingly in order to find the correct position in the document(s).

Lastly, find() will dispatch to findRec() to collect all concrete positions.

Returns:
As far as the return value is concerned, find() does a mere pass-through of findRec()'s return value to findall(). The return value is of type DocumentIndexingAutomatonFindResults&&, a C++11 rvalue reference.
Precondition:
As for the stage the automaton needs to be in, the caller find() has already taken care of, plus, this method is protected, therefore not callable from the outside anyway.

Implements lmu::cis::sis::DocumentIndexingAutomatonInterface.

Definition at line 161 of file DocumentIndexingAutomaton.cpp.

template<typename T >
DocumentIndexingAutomatonFindResults && lmu::cis::sis::DocumentIndexingAutomaton< T >::findall ( const std::string &  w) [virtual]

start a search / find all occurences of a string

Precondition:
Lazy method (accepts any of the following stages): ManagedAutomatonStage_ > EMPTY > SORTED > INDEXED
Postcondition:
ManagedAutomatonStage_ = INDEXED

ManagedAutomatonStage_ must be = INDEXED, if not the automaton will be sorted and indexed first (if needed)

See also:
For more in-depth information on concepts please refer to the following descriptions of general principles:

Implements lmu::cis::sis::DocumentIndexingAutomatonInterface.

Definition at line 120 of file DocumentIndexingAutomaton.cpp.

Referenced by main().

template<typename T >
DocumentIndexingAutomatonFindResults && lmu::cis::sis::DocumentIndexingAutomaton< T >::findRec ( const UINT  s,
const UINT  length 
) [protected, virtual]

protected method needed by find(), collects concrete positions in documents

In the chain of findall() -> find() -> findRec(), findRec()'s part is to collect the concrete positions of pattern p in the documents.

See also:
find() for a desciption of stages and return value.

Implements lmu::cis::sis::DocumentIndexingAutomatonInterface.

Definition at line 183 of file DocumentIndexingAutomaton.cpp.

template<typename T >
void lmu::cis::sis::DocumentIndexingAutomaton< T >::Index ( ) [virtual]

The indexing step.

Precondition:
Preconditions are:
  • ManagedAutomatonStage_ > EMPTY
  • (actually ManagedAutomatonStage_ >= SORTED, see "attention")
  • ManagedAutomatonStage_ < CLOSED
  • _document_sinkstate is available
Postcondition:
Postconditions are:
  • ManagedAutomatonStage_ == INDEXED
  • _states_documents is filled

The indexing step (following Blumer&Blumer) is crucial for all follwing findall() steps. Basically, it attaches the document's names to the states whose equivalence class can be found in those files (one or more times).

Note:
The time complexity for this step, again, is linear.
Attention:
For Index() to work, the automaton needs to be in SORTED stage. Should this not be the case, then, for convenience reasons, the automaton will be sorted for you. Otherwise, if the automaton already is in SORTED stage, this step will not be taken.
See also:
DocumentIndexingAutomaton: Stages

Definition at line 73 of file DocumentIndexingAutomaton.cpp.

Referenced by main().

virtual UINT lmu::cis::sis::CompressedAutomatonAdapterInterface::number_of_states ( ) const [pure virtual, inherited]
virtual UINT lmu::cis::sis::CompressedAutomatonAdapterInterface::number_of_transitions ( ) const [pure virtual, inherited]
virtual void lmu::cis::sis::CompressedAutomatonAdapterInterface::Shrink ( ) [pure virtual, inherited]
virtual void lmu::cis::sis::CompressedAutomatonAdapterInterface::SortTransitions ( ) [protected, pure virtual, inherited]
template<typename T >
UINT lmu::cis::sis::DocumentIndexingAutomaton< T >::suffixLink ( const UINT state) const [protected, virtual]

Follow a suffixlink.

Suffixlinks exist in the automaton's respective buildhelp structure, i.e. SCDAWGBuildHelp for AutomatonType SCDAWG, CDAWGBuildHelp for AutomatonType CompressedAutomaton / inenagaCDAWG.

suffixlink()s are needed by Index() in order to attach document information to the states recursively from the sinkstates.

Precondition:
Since the buildhelp structures are released in CLOSED stage, the automaton must be in ManagedAutomatonStage_ < CLOSED
Note:
This implementation is based on Blumer & Blumers algorithm and runs in linear time.
See also:
For more in-depth information see

Implements lmu::cis::sis::CompressedAutomatonAdapterInterface.

Definition at line 223 of file DocumentIndexingAutomaton.cpp.


Field Documentation

template<typename AutomatonType>
std::map<DocumentName, PositionsTuple> lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >::_document_length [protected]

association document -> document's length information (including current, absolute and actual positions in data)

Definition at line 85 of file DocumentIndexingAutomaton.hpp.

template<typename AutomatonType>
std::vector<DocumentName> lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >::_document_names [protected]

a list of all documents that have been added to the automaton (template AutomatonType)

Definition at line 86 of file DocumentIndexingAutomaton.hpp.

template<typename AutomatonType>
std::map<DocumentName, State> lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >::_document_sinkstate [protected]

association document -> sinkstate

Definition at line 84 of file DocumentIndexingAutomaton.hpp.

template<typename AutomatonType>
std::map<State, std::vector<DocumentName> > lmu::cis::sis::DocumentIndexingAutomaton< AutomatonType >::_states_documents [protected]

association state -> list of documents that state is attached to

Definition at line 87 of file DocumentIndexingAutomaton.hpp.

ManagedAutomatonStage lmu::cis::sis::ManagedStageAutomatonInterface::ManagedAutomatonStage_ = EMPTY [protected, inherited]

Definition at line 10 of file ManagedStageAutomatonInterface.hpp.

Referenced by lmu::cis::sis::InenagaCDAWGAdapter::Add(), lmu::cis::sis::CompressedAutomatonAdapter::Add(), lmu::cis::sis::SCDAWGAdapter::Add(), lmu::cis::sis::CompressedAutomatonAdapter::AddDictionary(), lmu::cis::sis::CompressedAutomatonAdapter::AddState(), lmu::cis::sis::CompressedAutomatonAdapter::AddTarjanTable(), lmu::cis::sis::SCDAWGAdapter::AddTarjanTable(), lmu::cis::sis::CompressedAutomatonAdapter::AddTransition(), lmu::cis::sis::InenagaCDAWGAdapter::buildhelp(), lmu::cis::sis::SCDAWGAdapter::buildhelp(), lmu::cis::sis::InenagaCDAWGAdapter::Close(), lmu::cis::sis::SCDAWGAdapter::Close(), lmu::cis::sis::CompressedAutomatonAdapter::Delta(), lmu::cis::sis::SCDAWGAdapter::Delta(), lmu::cis::sis::InenagaCDAWGAdapter::DumpGV(), lmu::cis::sis::CompressedAutomatonAdapter::DumpGV(), lmu::cis::sis::SCDAWGAdapter::DumpGV(), lmu::cis::sis::SCDAWGAdapter::DumpGVRight(), lmu::cis::sis::CompressedAutomatonAdapter::DumpStat(), lmu::cis::sis::InenagaCDAWGAdapter::DumpStat(), lmu::cis::sis::SCDAWGAdapter::DumpStat(), lmu::cis::sis::CompressedAutomatonAdapter::GenerateLanguage(), lmu::cis::sis::SCDAWGAdapter::GenerateLanguage(), lmu::cis::sis::CompressedAutomatonAdapter::GetInitialPairState(), lmu::cis::sis::SCDAWGAdapter::GetInitialPairState(), lmu::cis::sis::CompressedAutomatonAdapter::LeftAutomatonGenerateLanguage(), lmu::cis::sis::CompressedAutomatonAdapter::number_of_states(), lmu::cis::sis::CompressedAutomatonAdapter::number_of_transitions(), lmu::cis::sis::InenagaCDAWGAdapter::Read(), lmu::cis::sis::SCDAWGAdapter::Read(), lmu::cis::sis::CompressedAutomatonAdapter::Shrink(), lmu::cis::sis::SCDAWGAdapter::Shrink(), lmu::cis::sis::InenagaCDAWGAdapter::SortTransitions(), lmu::cis::sis::SCDAWGAdapter::SortTransitions(), lmu::cis::sis::InenagaCDAWGAdapter::suffixLink(), lmu::cis::sis::CompressedAutomatonAdapter::Write(), and lmu::cis::sis::SCDAWGAdapter::Write().


The documentation for this class was generated from the following files: