Advanced Data Structure Project

Title of Project Succinct data structure in top-k documents retrieval
The objective of Research The main aim of this project is to discover how to efficiently find the k documents where a given pattern occurs most frequently. While the problem has been discussed in many papers and solved in various ways, our research is to look for the novel algorithms and (succinct) data structures among lately related materials and find the one dominating almost all the space/time tradeoff.
Background/History of the Study Before we begin our aim to find such a succinct data structure, there are a number of fundamental works in our approach. There exist two main among many ideas in classic information retrieval: inverted index and term frequency. (Angelos, Giannis, Epimeneidis, Euripides, & Evangelos, 2005). The inverted index is also referred to as the postings file, which is an index data structure storing a mapping from the content. It is the most utilized data structure in the Information Retrieval domain, used on a large scale for example in search engines. Term frequency is a measure of how often a term is found in a collection of documents.

However, there are restricted assumptions for the efficiency of the ideas: the text must be easily tokenized into words, there must not be too many different words, and queries must be whole words or phrases, causing lots of difficulty in the document retrieval via various languages. Moreover, one of the attractive properties of an inverted file is that it is easily compressible while still supporting fast queries. In practice, an inverted file occupies space close to that if a compressed document collection. Niko & Veli, 2007) In further development, people find efficient data structures such as suffix arrays and suffix trees (full-text indexes) providing good space/time efficiency to inverted files.
Recently, several compressed full-text indexes have been proposed and show effective in practice as well. A generalized suffix tree is a suffix tree for a set of strings. Given the set of strings D = S(1), S(2), … S(n) of total length n, it is a Patricia tree containing all n suffixes of the strings. It can be built in time and space and can be used to find all k occurrences of a string P of length m in time.  However, it requires bits, which is significantly more than the collection size. Later on, Niko V. and Veli M. in their paper present an alternative space-efficient variant of Muthukrishnan’s structure that takes bits, with optimal time. (Niko & Veli, 2007) Based on the background study, we finally move advance to our intensive topic – A succinct data structure in top-k documents retrieval.
Research to the Study According to the background study above, the suffix tree is used to minimize space consumption. In the suffix tree document model, a document is considered as a string consisting of words, not characters. During constructing the suffix tree, each suffix of a document is compared to all suffixes which exist in the tree already to find out a position for inserting it. Hon W. K., Shah R., and Wu S. B. introduced the first efficient solution for the top-k document retrieval. (Hon, Shah, & Wu, 2009). In order to get rid of too many noisy factors in the large collection, the algorithm adds a minimum term frequency as one of the parameters for highly relevant patterns P. Hon, Shah, & Wu, 2009).
Furthermore, they also developed the f-mine problem for the high relevancy, that only documents that have more than f occurrences of the pattern need to be retrieved. The notion of relevance here is simply the term frequency. In the latter study, Hon W. K., Shah R., and Wu S. B. achieved the study of “Efficient Index for Retrieving Top-k Most Frequent Documents” by driving the solution derived from the related problems by Muthukrishnan, answering queries in time and taking space. The approach is based on a new use of the suffix tree called induced generalized suffix tree (IGST). (Hon, Shah, & Wu, 2009) The practicality of the proposed index is validated by the experimental results.
Future Works Since all the fundamental works are settled, our future analysis of the “Succinct data structure in top-k documents retrieval” is mainly based on the most recent accomplishment by Gonzalo N. and Daniel V. (Gonzalo & Daniel, 2012), a New Top-k Algorithm dominating almost all the space/time tradeoff.
References

H., Giannis, V., Epimeneidis, V., Euripides, P. G., & Evangelos, M. (2005).
Information Retrieval by Semantic Similarity. Dalhousie University, Faculty of Computer Science. Halifax: None. Bieganski, P. (1994).
Generalized suffix trees for biological sequence data: applications and implementation. Minnesota University, Dept. of Comput. Sci. Minneapolis: None. Gonzalo, N., & Daniel, V. (2012).
Space-Efficient Top-k Document Retrieval. Univ. of Chile, Dept. f Computer Science. Valdivia: None. Hon, W. K., Shah, R., & Wu, S. B. (2009).
Efficient INdex for Retrieving Top-k Most Frequency Documents. None: Springer, Heidelberg. Niko, V., & Veli, M. (2007).
Space-efficient Algorithms for Document Retrieval. The University of Helsinki, Department of Computer Science. Finland: None. (1998).
Augmenting suffix trees with applications. 6th Annual European Symposium on Algorithms (ESA 1998).

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our Guarantees

Money-back Guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism Guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision Policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy Policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation Guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more