Data Parallel Computing

Submitted by: Submitted by

Views: 169

Words: 11569

Pages: 47

Category: Science and Technology

Date Submitted: 03/10/2013 12:50 AM

Report This Essay

Distributed Aggregation for Data-Parallel Computing:

Interfaces and Implementations

Yuan Yu

Pradeep Kumar Gunda

Michael Isard

Microsoft Research

1065 La Avenida Ave.

Mountain View, CA 94043

yuanbyu@microsoft.com

Microsoft Research

1065 La Avenida Ave.

Mountain View, CA 94043

pgunda@microsoft.com

Microsoft Research

1065 La Avenida Ave.

Mountain View, CA 94043

misard@microsoft.com

ABSTRACT

1.

INTRODUCTION

Many data-mining computations have as a fundamental subroutine a “GroupBy-Aggregate” operation. This takes a dataset, partitions its records

into groups according to some key, then performs

an aggregation over each resulting group. GroupByAggregate is useful for summarization, e.g. finding

average household income by zip code from a census

dataset, but it is also at the heart of the distributed

implementation of algorithms such as matrix multiplication [22, 27]. The ability to perform GroupByAggregate at scale is therefore increasingly important, both for traditional data-mining tasks and also

for emerging applications such as web-scale machine

learning and graph analysis.

This paper analyzes the programming models that

are supplied for user-defined aggregation by several

state of the art distributed systems, evaluates a variety of optimizations that are suitable for aggregations with differing properties, and investigates the

interaction between the two. In particular, we show

that the choice of programming interface not only affects the ease of programming complex user-defined

aggregations, but can also make a material difference

to the performance of some optimizations.

GroupBy-Aggregate has emerged as a canonical

execution model in the general-purpose distributed

computing literature. Systems like MapReduce [9]

and Hadoop [3] allow programmers to decompose

an arbitrary computation into a sequence of maps

Categories and Subject Descriptors

and reductions, which are written in a full-fledged

D.1.3 [Programming...