uk.ac.shef.dcs.oak.jate.core.feature.indexer
Class GlobalIndexMem

java.lang.Object
  extended by uk.ac.shef.dcs.oak.jate.core.feature.indexer.GlobalIndex
      extended by uk.ac.shef.dcs.oak.jate.core.feature.indexer.GlobalIndexMem

public class GlobalIndexMem
extends GlobalIndex

GlobalIndexMem stores information (in-memory) of binary relations between candidate terms (word or phrase) and corpus. These include:
- candidate term canonical forms and their int ids
- candidate term variant forms and their int ids
- mapping from candidate term canonical form to variant forms
- corpus elements (document) and their int ids
- candidate term and their containing documents (id - ids)
- document ids and their contained candidate terms (id - ids)

In other words, a GlobalIndexMem contains mappings from a candidate term to its id, mappings from a candidate term's canonical form to its variants found in the corpus, a document to its id, a lexical unit id to the document ids that contain the unit, a document id to the lexical unit ids which the document contains.


Field Summary
protected  java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> _doc2Terms
           
protected  java.util.HashMap<Document,java.lang.Integer> _docIdMap
           
protected  java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> _term2Docs
           
protected  java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> _term2Variants
           
protected  java.util.HashMap<java.lang.String,java.lang.Integer> _termIdMap
           
protected  java.util.Map<java.lang.Integer,java.lang.Integer> _variant2term
           
protected  java.util.HashMap<java.lang.String,java.lang.Integer> _variantIdMap
           
 
Fields inherited from class uk.ac.shef.dcs.oak.jate.core.feature.indexer.GlobalIndex
_docCounter, _termCounter, _variantCounter
 
Constructor Summary
protected GlobalIndexMem()
           
 
Method Summary
 java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> getDoc2Terms()
           
 java.util.Map<Document,java.lang.Integer> getDocIdMap()
           
 java.util.Set<java.lang.Integer> getDocumentIds()
           
 java.util.Set<Document> getDocuments()
           
 java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> getTerm2Docs()
           
 java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> getTerm2Variants()
           
 java.util.Set<java.lang.Integer> getTermCanonicalIds()
           
 java.util.Map<java.lang.String,java.lang.Integer> getTermIdMap()
           
 java.util.Set<java.lang.String> getTermsCanonical()
           
 java.util.Set<java.lang.Integer> getTermVariantIds()
           
 java.util.Set<java.lang.String> getTermVariants()
           
 java.util.Map<java.lang.Integer,java.lang.Integer> getVariant2Term()
           
 java.util.Map<java.lang.String,java.lang.Integer> getVariantIdMap()
           
protected  int indexDocument(Document d)
          Given a document, return its id.
protected  void indexDocWithTermsCanonical(Document d, java.util.Set<java.lang.String> terms)
          Given a document d which contains a set of terms (canonical form), index the binary relation "document contains term canonical"
protected  void indexDocWithTermsCanonical(int d, java.util.Set<java.lang.Integer> terms)
          Given a document with id d which contains a set of terms (canonical form), index the binary relation "document contains term canonical"
protected  int indexTermCanonical(java.lang.String term)
          Given a candidate term's canonical form, return its id.
protected  void indexTermCanonicalInDoc(int t, int d)
          Given a candidate term's canonical form id t found in document with id d, index the binary relation "t found_in d"
protected  void indexTermCanonicalInDoc(java.lang.String t, Document d)
          Given a candidate term's canonical form t found in document d, index the binary relation "t found_in d"
protected  int indexTermVariant(java.lang.String termV)
          Given a candidate term variant, index it and return its id.
protected  void indexTermWithVariant(java.util.Map<java.lang.String,java.util.Set<java.lang.String>> map)
          Given a map containing [term canonical form - term variant forms], index the mapping, plus the mapping from term variant to term canonical
 int retrieveCanonicalOfTermVariant(java.lang.String termVar)
          Given a term variant form, retrieve its canonical form
 java.util.Set<java.lang.Integer> retrieveDocIdsContainingTermCanonical(int id)
           
 java.util.Set<java.lang.Integer> retrieveDocIdsContainingTermCanonical(java.lang.String t)
           
 java.util.Set<Document> retrieveDocsContainingTermCanonical(int t)
           
 java.util.Set<Document> retrieveDocsContainingTermCanonical(java.lang.String t)
           
 int retrieveDocument(Document d)
          Given a document, return its id.
 Document retrieveDocument(int id)
          Given a document id return the document
 java.lang.String retrieveTermCanonical(int id)
          Given an id, retrieve the candidate term's canonical form
 int retrieveTermCanonical(java.lang.String term)
          Given a candidate term's canonical form, return its id.
 java.util.Set<java.lang.Integer> retrieveTermCanonicalIdsInDoc(Document d)
           
 java.util.Set<java.lang.Integer> retrieveTermCanonicalIdsInDoc(int d)
           
 java.util.Set<java.lang.String> retrieveTermCanonicalInDoc(int d)
           
 java.util.Set<java.lang.String> retrieveTermsCanonicalInDoc(Document d)
           
protected  java.lang.String retrieveTermVariant(int id)
          Given an id of a candidate term variant, retrieve the text
 java.util.Set<java.lang.String> retrieveVariantsOfTermCanonical(java.lang.String term)
          Given a term canonical form, retrieve its variant forms found in the corpus
 int sizeDocHasTerms(Document d)
           
 int sizeDocHasTerms(int d)
           
 int sizeTermInDocs(int t)
           
 int sizeTermInDocs(java.lang.String t)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_termIdMap

protected java.util.HashMap<java.lang.String,java.lang.Integer> _termIdMap

_variantIdMap

protected java.util.HashMap<java.lang.String,java.lang.Integer> _variantIdMap

_docIdMap

protected java.util.HashMap<Document,java.lang.Integer> _docIdMap

_term2Docs

protected java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> _term2Docs

_doc2Terms

protected java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> _doc2Terms

_term2Variants

protected java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> _term2Variants

_variant2term

protected java.util.Map<java.lang.Integer,java.lang.Integer> _variant2term
Constructor Detail

GlobalIndexMem

protected GlobalIndexMem()
Method Detail

getTermIdMap

public java.util.Map<java.lang.String,java.lang.Integer> getTermIdMap()

getVariantIdMap

public java.util.Map<java.lang.String,java.lang.Integer> getVariantIdMap()

getDocIdMap

public java.util.Map<Document,java.lang.Integer> getDocIdMap()

getTerm2Docs

public java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> getTerm2Docs()

getDoc2Terms

public java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> getDoc2Terms()

getTerm2Variants

public java.util.Map<java.lang.Integer,java.util.Set<java.lang.Integer>> getTerm2Variants()

getVariant2Term

public java.util.Map<java.lang.Integer,java.lang.Integer> getVariant2Term()

indexTermCanonical

protected int indexTermCanonical(java.lang.String term)
Given a candidate term's canonical form, return its id. If the term has not been indexed, it will be indexed with a new id and that id will be returned

Specified by:
indexTermCanonical in class GlobalIndex
Parameters:
term -
Returns:
the id

retrieveTermCanonical

public int retrieveTermCanonical(java.lang.String term)
Given a candidate term's canonical form, return its id. (If the term has not been index, -1 will be returned

Specified by:
retrieveTermCanonical in class GlobalIndex
Parameters:
term -
Returns:
the id

retrieveTermCanonical

public java.lang.String retrieveTermCanonical(int id)
Given an id, retrieve the candidate term's canonical form

Specified by:
retrieveTermCanonical in class GlobalIndex
Parameters:
id -
Returns:

getTermCanonicalIds

public java.util.Set<java.lang.Integer> getTermCanonicalIds()
Specified by:
getTermCanonicalIds in class GlobalIndex
Returns:
all indexed candidate term canonical form ids

getTermsCanonical

public java.util.Set<java.lang.String> getTermsCanonical()
Specified by:
getTermsCanonical in class GlobalIndex
Returns:
all candidate term canonical forms

indexTermVariant

protected int indexTermVariant(java.lang.String termV)
Given a candidate term variant, index it and return its id.

Specified by:
indexTermVariant in class GlobalIndex
Parameters:
termV -
Returns:
the id

retrieveTermVariant

protected java.lang.String retrieveTermVariant(int id)
Given an id of a candidate term variant, retrieve the text

Specified by:
retrieveTermVariant in class GlobalIndex
Parameters:
id -
Returns:

getTermVariantIds

public java.util.Set<java.lang.Integer> getTermVariantIds()
Specified by:
getTermVariantIds in class GlobalIndex

getTermVariants

public java.util.Set<java.lang.String> getTermVariants()
Specified by:
getTermVariants in class GlobalIndex

indexDocument

protected int indexDocument(Document d)
Given a document, return its id. If the document has not been indexed, index it and return the new id

Specified by:
indexDocument in class GlobalIndex
Parameters:
d -
Returns:
the id

retrieveDocument

public int retrieveDocument(Document d)
Given a document, return its id. If the document has not been indexed, return -1

Specified by:
retrieveDocument in class GlobalIndex
Parameters:
d -
Returns:
the id

retrieveDocument

public Document retrieveDocument(int id)
Given a document id return the document

Specified by:
retrieveDocument in class GlobalIndex
Parameters:
id -
Returns:

getDocuments

public java.util.Set<Document> getDocuments()
Specified by:
getDocuments in class GlobalIndex
Returns:
all indexed documents

getDocumentIds

public java.util.Set<java.lang.Integer> getDocumentIds()
Specified by:
getDocumentIds in class GlobalIndex
Returns:
return all indexed document ids

indexTermWithVariant

protected void indexTermWithVariant(java.util.Map<java.lang.String,java.util.Set<java.lang.String>> map)
Given a map containing [term canonical form - term variant forms], index the mapping, plus the mapping from term variant to term canonical

Specified by:
indexTermWithVariant in class GlobalIndex
Parameters:
map -

retrieveVariantsOfTermCanonical

public java.util.Set<java.lang.String> retrieveVariantsOfTermCanonical(java.lang.String term)
Given a term canonical form, retrieve its variant forms found in the corpus

Specified by:
retrieveVariantsOfTermCanonical in class GlobalIndex
Parameters:
term -
Returns:

retrieveCanonicalOfTermVariant

public int retrieveCanonicalOfTermVariant(java.lang.String termVar)
Given a term variant form, retrieve its canonical form

Specified by:
retrieveCanonicalOfTermVariant in class GlobalIndex
Parameters:
termVar -
Returns:

indexTermCanonicalInDoc

protected void indexTermCanonicalInDoc(java.lang.String t,
                                       Document d)
Given a candidate term's canonical form t found in document d, index the binary relation "t found_in d"

Specified by:
indexTermCanonicalInDoc in class GlobalIndex
Parameters:
t -
d -

indexTermCanonicalInDoc

protected void indexTermCanonicalInDoc(int t,
                                       int d)
Given a candidate term's canonical form id t found in document with id d, index the binary relation "t found_in d"

Specified by:
indexTermCanonicalInDoc in class GlobalIndex
Parameters:
t -
d -

retrieveDocIdsContainingTermCanonical

public java.util.Set<java.lang.Integer> retrieveDocIdsContainingTermCanonical(java.lang.String t)
Specified by:
retrieveDocIdsContainingTermCanonical in class GlobalIndex
Parameters:
t - the candidate term's canonical form in question
Returns:
the document ids of which documents contain the candidate term t

retrieveDocIdsContainingTermCanonical

public java.util.Set<java.lang.Integer> retrieveDocIdsContainingTermCanonical(int id)
Specified by:
retrieveDocIdsContainingTermCanonical in class GlobalIndex
Parameters:
id - the candidate term's canonical form in questoin
Returns:
the document ids of which documents contain the candidate term t

retrieveDocsContainingTermCanonical

public java.util.Set<Document> retrieveDocsContainingTermCanonical(java.lang.String t)
Specified by:
retrieveDocsContainingTermCanonical in class GlobalIndex
Parameters:
t - the candidate term's canonical form in question
Returns:
the documents which contain the candidate term t

retrieveDocsContainingTermCanonical

public java.util.Set<Document> retrieveDocsContainingTermCanonical(int t)
Specified by:
retrieveDocsContainingTermCanonical in class GlobalIndex
Parameters:
t - the candidate term's canonical form id in question
Returns:
the documents which contain the candidate term t

sizeTermInDocs

public int sizeTermInDocs(java.lang.String t)
Specified by:
sizeTermInDocs in class GlobalIndex
Parameters:
t - the candidate term's canonical form
Returns:
number of documents that contain the candidate term (any variants)

sizeTermInDocs

public int sizeTermInDocs(int t)
Specified by:
sizeTermInDocs in class GlobalIndex
Parameters:
t - the id of candidate term's canonical form
Returns:
number of documents that contain the candidate term (any variants)

indexDocWithTermsCanonical

protected void indexDocWithTermsCanonical(Document d,
                                          java.util.Set<java.lang.String> terms)
Given a document d which contains a set of terms (canonical form), index the binary relation "document contains term canonical"

Specified by:
indexDocWithTermsCanonical in class GlobalIndex
Parameters:
d -
terms - canonical forms of candidate terms found in document d

indexDocWithTermsCanonical

protected void indexDocWithTermsCanonical(int d,
                                          java.util.Set<java.lang.Integer> terms)
Given a document with id d which contains a set of terms (canonical form), index the binary relation "document contains term canonical"

Specified by:
indexDocWithTermsCanonical in class GlobalIndex
Parameters:
d - id of document
terms - canonical forms of candidate terms found in document d

retrieveTermCanonicalIdsInDoc

public java.util.Set<java.lang.Integer> retrieveTermCanonicalIdsInDoc(Document d)
Specified by:
retrieveTermCanonicalIdsInDoc in class GlobalIndex
Parameters:
d -
Returns:
the ids of canonical forms of terms found in the document d

retrieveTermCanonicalIdsInDoc

public java.util.Set<java.lang.Integer> retrieveTermCanonicalIdsInDoc(int d)
Specified by:
retrieveTermCanonicalIdsInDoc in class GlobalIndex
Parameters:
d -
Returns:
the ids of canonical forms of terms found in the document d

retrieveTermsCanonicalInDoc

public java.util.Set<java.lang.String> retrieveTermsCanonicalInDoc(Document d)
Specified by:
retrieveTermsCanonicalInDoc in class GlobalIndex
Parameters:
d -
Returns:
the canonical form of terms found in the document d

retrieveTermCanonicalInDoc

public java.util.Set<java.lang.String> retrieveTermCanonicalInDoc(int d)
Specified by:
retrieveTermCanonicalInDoc in class GlobalIndex
Parameters:
d -
Returns:
the canonical form of terms found in the document d

sizeDocHasTerms

public int sizeDocHasTerms(Document d)
Specified by:
sizeDocHasTerms in class GlobalIndex
Parameters:
d -
Returns:
number of unique candidate terms (canonical form) found in document d

sizeDocHasTerms

public int sizeDocHasTerms(int d)
Specified by:
sizeDocHasTerms in class GlobalIndex
Parameters:
d -
Returns:
number of unique candidate terms (canonical form) found in document d