Skip to content
Yuzhen Huang edited this page Apr 24, 2019 · 3 revisions

Overview

At a high level, a Tangram application contains a user program written in C++, a Master program and a cluster of Worker program. To create a user application, a user writes a user program with the high-level Tangram library, compiles and links the user code with the worker program. The user program can then be submitted by a python launching script which will launch the Master and a cluster of Workers. The Master analyzes the DAG of MapUpdate plans and coordinates with the Workers to finish the program.

Collection

The main data abstraction Tangram provides are collection, which is a group of element partitioned across the nodes of the cluster that can be operated on in parallel. The collection in Tangram corresponds to the resilient distributed dataset (RDD) in Spark but the collection in Tangram can be mutable in a MapUpdate plan, which offers more generality.

Creating Collections

Collections in Tangram are created by starting with a file or a folder from the Hadoop file system, or a vector of element from the C++ user program. Besides, a placeholder collection (empty) can also be created.

// 1. Load a file from HDFS as a Tangram collection with each line being an element.
// User needs to provide a parse function to determine how 
// a line should be parsed into an object in the collection.
// The collection is of type Collection<String>.
auto c1 = Context::load("/path/to/file/in/hdfs", [](const std::string) { return s; });

// 2. Distribute a C++ vector to a Tangram collection with 2 partitions.
auto c2 = Context::distribute(
    std::vector<std::string>{"b", "a", "n", "a", "n", "a"}, 2);

// 3. Create a placeholder collection of type ObjT with 10 partitions.
auto c3 = Context::placeholder<ObjT>(10);

User-defined Object Type in Collection

Users can define their own object type to be stored in collection. For example, we can define a Vertex type with id and outlinks as internal fields:

struct Vertex {
  using KeyT = int;

  Vertex() = default;
  Vertex(KeyT id) : id(id) {}

  KeyT Key() const { return id; }

  // fields
  KeyT id;
  std::vector<int> outlinks;

  // serialization functions
  friend SArrayBinStream &operator<<(xyz::SArrayBinStream &stream,
                                     const Vertex &vertex) {
    stream << vertex.id << vertex.outlinks;
    return stream;
  }
  friend SArrayBinStream &operator>>(xyz::SArrayBinStream &stream,
                                     Vertex &vertex) {
    stream >> vertex.id >> vertex.outlinks;
    return stream;
  }
};

A few things that are worthy to mention:

  1. As we want the Vertex collection to be **indexable **(i.e., one can find a specific vertex by its key), the user-defined type should have a KeyT specifying the key type, and a Key() function which can return the key (i.e., the unique name of the object).
  2. We need to provide a constructor taking a key as the argument. This is because with this constructor, Tangram can automatically create a Vertex object if a given key does not exist.
  3. We need to provide the **serialization/deserialization function **so that the Vertex can be serialize/deserialize to/from disks or remote machines.

Tangram provides high-level dataflow-like APIs in C++.

Clone this wiki locally