3110 Assignment

Submitted by: Submitted by

Views: 299

Words: 1657

Pages: 7

Category: Other Topics

Date Submitted: 02/06/2013 09:25 PM

Report This Essay

COM3110 Text Processing (2012/13) Assignment: Document Retrieval

Task in brief: To complete a basic document retrieval system and evaluate its performance. Submission: Submit your assignment work electronically via MOLE. Precise instructions for what files to submit are given later in this document. Please check that you can access the relevant MOLE unit (listed as “COM3110∼COM4115∼COM6115”) and let me know if not. SUBMISSION DEADLINE: 3pm, Wednesday, 21 November, 2012

Penalties: Standard departmental penalties apply for late hand-in and for plagiarism

Materials Provided

Download the file 3110 Assignment Files.zip from the module homepage, which unzips to give a folder containing a number of code and data files, for use in the assignment. Data files: The materials provided include a file documents.txt, which contains a collection of documents that record publications in the CACM (Communications of the Association for Computing Machinery). Each document is a short record of a CACM paper, including its title, author(s), and abstract — although one or other of these (especially abstract) may be absent for a given document. The file queries.txt contains a set of IR queries for use against this collection. (These are ‘old-style’ queries, where users might write an entire paragraph describing their interest.) The file cacm gold std.txt is a ‘gold standard’ identifying the documents that have been judged relevant to each query. These three files together constitute a standard test set that has been used for evaluating IR systems (although it is now somewhat dated, not least by being very small by modern standards). As discussed in class, a standard IR system creates an inverted index over a document collection (such as documents.txt), to allow efficient identification of the documents relevant to a query. Various choices are made in preprocessing documents before indexation (e.g. whether a stoplist is used, whether terms are stemmed, etc) with various consequences (e.g. for...