This project has retired. For details please refer to its Attic page.
Apache REEF - FAQ

FAQ

  1. Who is REEF for?
  2. Why did it come about?
  3. What is Tang?
  4. What is Wake?
  5. How can I get started?

Who is REEF for?

REEF is for developers of data processing systems on cloud computing platforms that provide fine-grained resource allocations. REEF provides system authors with a centralized (pluggable) control flow that embeds a user-defined system controller called the Job Driver. The interfaces associated with the Job Driver are event driven; events signal resource allocations and failures, various states associated with task executions and communication channels, alarms based on wall-clock or logical time, and so on. REEF also aims to package a variety of data-processing libraries (e.g., high-bandwidth shuffle, relational operators, low-latency group communication, etc.) in a reusable form. Authors of big data systems and toolkits can leverage REEF to immediately begin development of their application specific data flow, while reusing packaged libraries where they make sense.


Why did it come about?

Traditional data-processing systems are built around a single programming model (like SQL or MapReduce) and a runtime (query) engine. These systems assume full ownership over the machine resources used to execute compiled queries. For example, Hadoop (version one) supports the MapReduce programming model, which is used to express jobs that execute a map step followed by an optional reduce step. Each step is carried out by some number of parallel tasks. The Hadoop runtime is built on a single master (the JobTracker) that schedules map and reduce tasks on a set of workers (TaskTrackers) that expose fixed-sized task “slots”. This design leads to three key problems in Hadoop:

  1. The resources tied to a TaskTracker are provisioned for MapReduce only.
  2. Clients must speak some form of MapReduce in order to make use of cluster resources, and in turn, gain compute access to the data that lives there.
  3. Poor cluster utilization, especially in the case of idle resources (slots) due to straggler tasks.

With YARN (Hadoop version two), resource management has been decoupled from the MapReduce programming model in Hadoop, freeing cluster resources from slotted formats, and opening the door to programming frameworks beyond MapReduce. It is well understood that while enticingly simple and fault-tolerant, the MapReduce model is not ideal for many applications, especially iterative or recursive workloads like machine learning and graph processing, and those that tremendously benefit from main memory (as opposed to disk based) computation. A variety of big data systems stem from this insight: Microsoft’s Dryad, Apache Spark, Google’s Pregel, CMU’s GraphLab and UCI’s AsterixDB, to name a few. Each of these systems add unique capabilities, but form islands around key functionalities, making it hard to share both data and compute resources between them. YARN, and related resource managers, move us one step closer toward a unified Big Data system stack. The goal of REEF is to provide the next level of detail in this layering.


What is Tang?

Tang is a dependency injection Framework co-developed with REEF. It has extensive documentation which can be found here.


What is Wake?

Please refer to this section.


How can I get started?

Check out the REEF Tutorial and the Contributing page and join the community!