Submitted by: Submitted by adtyshkhr
Views: 169
Words: 11569
Pages: 47
Category: Science and Technology
Date Submitted: 03/10/2013 12:50 AM
Distributed Aggregation for Data-Parallel Computing:
Interfaces and Implementations
Yuan Yu
Pradeep Kumar Gunda
Michael Isard
Microsoft Research
1065 La Avenida Ave.
Mountain View, CA 94043
yuanbyu@microsoft.com
Microsoft Research
1065 La Avenida Ave.
Mountain View, CA 94043
pgunda@microsoft.com
Microsoft Research
1065 La Avenida Ave.
Mountain View, CA 94043
misard@microsoft.com
ABSTRACT
1.
INTRODUCTION
Many data-mining computations have as a fundamental subroutine a “GroupBy-Aggregate” operation. This takes a dataset, partitions its records
into groups according to some key, then performs
an aggregation over each resulting group. GroupByAggregate is useful for summarization, e.g. finding
average household income by zip code from a census
dataset, but it is also at the heart of the distributed
implementation of algorithms such as matrix multiplication [22, 27]. The ability to perform GroupByAggregate at scale is therefore increasingly important, both for traditional data-mining tasks and also
for emerging applications such as web-scale machine
learning and graph analysis.
This paper analyzes the programming models that
are supplied for user-defined aggregation by several
state of the art distributed systems, evaluates a variety of optimizations that are suitable for aggregations with differing properties, and investigates the
interaction between the two. In particular, we show
that the choice of programming interface not only affects the ease of programming complex user-defined
aggregations, but can also make a material difference
to the performance of some optimizations.
GroupBy-Aggregate has emerged as a canonical
execution model in the general-purpose distributed
computing literature. Systems like MapReduce [9]
and Hadoop [3] allow programmers to decompose
an arbitrary computation into a sequence of maps
Categories and Subject Descriptors
and reductions, which are written in a full-fledged
D.1.3 [Programming...