Statistical Relational Learning

Submitted by: Submitted by

Views: 382

Words: 6307

Pages: 26

Category: Societal Issues

Date Submitted: 02/17/2012 08:48 AM

Report This Essay

In Proceedings of IEEE International Conference on Data Mining, ICDM-2003.

Statistical Relational Learning for Document Mining

Alexandrin Popescul Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 USA popescul@cis.upenn.edu Steve Lawrence Google 2400 Bayshore Parkway Mountain View, CA 94043 USA lawrence@google.com

 

Lyle H. Ungar Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 USA ungar@cis.upenn.edu David M. Pennock Overture Services, Inc. 74 N. Pasadena Ave., 3rd floor Pasadena, CA 91103 USA david.pennock@overture.com

 

Abstract

A major obstacle to fully integrated deployment of many data mining algorithms is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. We propose an integrated approach to statistical modeling from relational databases. We structure the search space based on “refinement graphs”, which are widely used in inductive logic programming for learning logic descriptions. The use of statistics allows us to extend the search space to include richer set of features, including many which are not boolean. Search and model selection are integrated into a single process, allowing information criteria native to the statistical model, for example logistic regression, to make feature selection decisions in a step-wise manner. We present experimental results for the task of predicting where scientific papers will be published based on relational data taken from CiteSeer. Our approach results in classification accuracies superior to those achieved when using classical “flat” features. The resulting classifier can be used to recommend where to publish articles.

1. Introduction

Statistical learning techniques play an important role in data mining, however, their standard formulation is almost exclusively limited to a one table domain representation. Such algorithms are presented with a set of candidate features,...