Submitted by: Submitted by a315344690
Views: 299
Words: 1657
Pages: 7
Category: Other Topics
Date Submitted: 02/06/2013 09:25 PM
COM3110 Text Processing (2012/13) Assignment: Document Retrieval
Task in brief: To complete a basic document retrieval system and evaluate its performance. Submission: Submit your assignment work electronically via MOLE. Precise instructions for what files to submit are given later in this document. Please check that you can access the relevant MOLE unit (listed as “COM3110∼COM4115∼COM6115”) and let me know if not. SUBMISSION DEADLINE: 3pm, Wednesday, 21 November, 2012
Penalties: Standard departmental penalties apply for late hand-in and for plagiarism
Materials Provided
Download the file 3110 Assignment Files.zip from the module homepage, which unzips to give a folder containing a number of code and data files, for use in the assignment. Data files: The materials provided include a file documents.txt, which contains a collection of documents that record publications in the CACM (Communications of the Association for Computing Machinery). Each document is a short record of a CACM paper, including its title, author(s), and abstract — although one or other of these (especially abstract) may be absent for a given document. The file queries.txt contains a set of IR queries for use against this collection. (These are ‘old-style’ queries, where users might write an entire paragraph describing their interest.) The file cacm gold std.txt is a ‘gold standard’ identifying the documents that have been judged relevant to each query. These three files together constitute a standard test set that has been used for evaluating IR systems (although it is now somewhat dated, not least by being very small by modern standards). As discussed in class, a standard IR system creates an inverted index over a document collection (such as documents.txt), to allow efficient identification of the documents relevant to a query. Various choices are made in preprocessing documents before indexation (e.g. whether a stoplist is used, whether terms are stemmed, etc) with various consequences (e.g. for...