Assighnemt

Submitted by: Submitted by

Views: 326

Words: 2038

Pages: 9

Category: Business and Industry

Date Submitted: 05/02/2011 09:58 AM

Report This Essay

COM 3110/6150 Text Processing (2010/11) Assignment: Document Retrieval

Task in brief: To implement and test a basic document retrieval system. Submission: Your assignment work is to be submitted electronically using MOLE. Precise instructions for what files to submit are given later in this document. SUBMISSION DEADLINE: 3pm, Monday, Week 9 (22 November, 2010)

Penalties: Standard departmental penalties apply for late hand-in and for plagiarism

Materials Provided

Download the file TP Assignment Files.zip from the module homepage, which unzips to give a folder containing the following data and code files, for use in the assignment: data files: code files: documents.txt, queries.txt, cacm gold std.txt, stoplist.txt example results file.txt Collection.py, eval ir.py

The file documents.txt contains a collection of documents which record publications in the CACM (Communications of the Association for Computing Machinery). Each document is a short record of a CACM paper, including its title, author(s), and abstract — although one or other of these (especially abstract) may be absent for a given document. The file queries.txt contains a set of IR queries for use against this collection. (These are ‘old-style’ queries, where users might write an entire paragraph describing their interest.) The file cacm gold std.txt is a ‘gold standard’ identifying the documents that have been judged relevant to each query. These files constitute a standard test set that has been used for evaluating IR systems (although it is now somewhat dated, not least by being very small by modern standards). Code files: If you inspect the files documents.txt and queries.txt, you will see that they have a common format, where each document or query comes enclosed within (XML-style) open and close document tags, that also specify a numeric identifier for the document/query. The Python class Collection.py provides convenient access to documents/queries in the manner of a simple iteration. In particular, if...

More like this