Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Coding Guideline

Tianyu Li edited this page Aug 21, 2018 · 29 revisions

Directory Structure and Namespaces

In the project root directory, there are three major places where you will put code:

  • src -- This is where the bulk of the code for lives. Anything you expect to be compiled into the release should be here.
  • test -- This is where unit tests, benchmarks, and utility code for them lives. src should not have dependency going into test.
  • script -- Where scripts that support development and testing lives. (e.g. python formatting script, dependency installation).

Almost never will you need to create new directories outside of these. Here are guidelines for adding files to each.

src


There can be at most 2-levels of directories under src, the first level will be general system components (e.g. storage, execution, network, sql, common), and the second level will be either for a class of similar files, or for a self-contained sub-component.

Translated into coding guidelines, you should rarely need to create a new first-level subdirectory, and should probably consult Andy if you believe you do. To create a new secondary directory, make sure you meet the following criteria:

  • There are more than 2 (exclusive) files you need to put into this folder
  • Each file is stand-alone, i.e. either the contents don't make sense living in a single file, or that putting them in a single file makes the file large and difficult to navigate. (This is open to interpretation, but if, for example, you have 3 files containing 10-line class definitions, maybe they should not be spread out that much).

And one of the two:

  • The subdirectory is a self-contained sub-component. This probably means that the folder only has one outward facing API. A good rule of thumb is when outside code files only need to include one header from this folder, where said API is defined.
  • The subdirectory contains a logical grouping of files, and there are enough of them that leaving them ungrouped makes the upper level hard to navigate. (e.g. all the plans, all the common data structures, etc.) A good rule of thumb is if you have subdirectory As, you should be able to say with a straight face that everything under As is an A. (e.g. Everything under containers is a container)

Every class and/or function under these directories should be in namespaces. All code will be under namespace terrier, and namespace the same as their first-level directory name (e.g common, storage). Secondary sub-directories do not have associated namespaces.

test


The directory structure of the test folder should generally reflect the directory structure of the src folder, ignoring the include. Each test should be under the same path as the file they test, and named "XXX_test" (for example, src/include/common/container/bitmap.h will be tested under test/common/container/bitmap_test.cpp)

Generally, there can be no code sharing between tests since they are different build targets. There are cases, however, where it makes sense for tests to share some common utility function. In that case, you can write a utility file under test/util.

The test/util folder should have no sub-directories. In most cases, one test util file for every directory under src should suffice. (e.g. test/util/storage_test_util.h) Sometimes it will make sense to have a util file for stand-alone modules (e.g. test/include/util/random_test_util.h).

(TODO: Benchmarks, Nightly Tests)

script


TBD

Code Organization

We strive to write modern c++ code (c++ 17), but c++ is a language with a lot of legacy features from the 80s. Certain guidelines must be followed to make the use of language features tasteful and elegant.

.h and .cpp files


Surprisingly, there is no universal standards on what to call c++ code files. In this project, we will use .h for headers, inside the various /include directories, and .cpp for the implementation.

When possible, implementation should be separated between .h and .cpp files, as this will make the compilation process much faster and hassle-free. Documentation should be written in the .h files. There are a couple of exceptions to this rule:

  • One-liners or otherwise boilerplate code that is unlikely to change. What constitutes a one-liner is somewhat ambiguous and can be subjective, but > 5 lines would not pass the smell test.
  • Templates. The c++ compiler generates instantiations of actual code based on the template given, and may need this information in the compilation units themselves, thus requiring all definition to be present in the included header. There are two solutions to this: either write all of the template code in the header or explicitly instantiate templates. Because doing the latter is painful, we generally will write those definitions in-place.

Forward Declaration


When referring to some object only by reference, object or some template arguments, it is not necessary that the code knows its structure. As a result, we do not necessarily need to provide its complete declaration with members and methods (i.e. #include), but can get away with a hint that such a class exist. Example is given below:

class Foo;
...
void DoSomething(Foo *foo);  // compiles
void DoSomethingElse(Foo *foo) {
  foo->bar();  // error, member access into opaque type
}
...

Doing this saves re-compilation time. As a rule of thumb, forward declare when possible in the header, but always include the actual headers in the .cpp file.

The "Wall of Text"


A "Wall of Text" is a dense, unstructured piece of code with few or no paragraph breaks. Here is an example extracted from our old codebase, verbatim:

...
ResultType TestingSQLUtil::ExecuteSQLQuery(
    const std::string query, std::vector<ResultValue> &result,
    std::vector<FieldInfo> &tuple_descriptor, int &rows_changed,
    std::string &error_message) {
  LOG_TRACE("Query: %s", query.c_str());
  // prepareStatement
  std::string unnamed_statement = "unnamed";
  auto &peloton_parser = parser::PostgresParser::GetInstance();
  auto sql_stmt_list = peloton_parser.BuildParseTree(query);
  PELOTON_ASSERT(sql_stmt_list);
  if (!sql_stmt_list->is_valid) {
    return ResultType::FAILURE;
  }
  auto statement = traffic_cop_.PrepareStatement(unnamed_statement, query,
                                                 std::move(sql_stmt_list));
  if (statement.get() == nullptr) {
    traffic_cop_.setRowsAffected(0);
    rows_changed = 0;
    error_message = traffic_cop_.GetErrorMessage();
    return ResultType::FAILURE;
  }
  // ExecuteStatment
  std::vector<type::Value> param_values;
  bool unnamed = false;
  std::vector<int> result_format(statement->GetTupleDescriptor().size(), 0);
  // SetTrafficCopCounter();
  counter_.store(1);
  auto status = traffic_cop_.ExecuteStatement(statement, param_values, unnamed,
                                              nullptr, result_format, result);
  if (traffic_cop_.GetQueuing()) {
    ContinueAfterComplete();
    traffic_cop_.ExecuteStatementPlanGetResult();
    status = traffic_cop_.ExecuteStatementGetResult();
    traffic_cop_.SetQueuing(false);
  }
  if (status == ResultType::SUCCESS) {
    tuple_descriptor = statement->GetTupleDescriptor();
  }
  LOG_TRACE("Statement executed. Result: %s",
            ResultTypeToString(status).c_str());
  rows_changed = traffic_cop_.getRowsAffected();
  return status;
}
...

Needless to say, this is bad. You should not write one in our new codebase. Having dense code itself is not a problem, but the "Wall of Text" is usually a symptom of one of the following, more severe problems, ranked from least severe to most severe, with mitigation strategies:

  • Lack of meaningful logical blocks, comments, and readability features such as extra blank line between irrelevant statement groups. Solution: Write code in logical blocks, clearly separated by an extra blank line, and prefixed with comments.

  • Lack of abstraction in the code. Some of the code clutter may qualify for helper methods. Longer methods might have been broken up into several smaller ones. Solution: Avoid long methods. Actively seek out opportunities to abstract reusable code blocks out into functions.

  • Copy-and-pasted code. Solution: DON'T

Dependency Injection


(More TBD)

Ask us how we know, but Singletons, and hard-coded dependencies in general, do wonders to make a codebase hard to understand, hard to test, and hard to maintain.

The solution is to use a specific style of coding to avoid having to deal with this. Dependency Injection is a widely used paradigm in industry for this problem, and it is easier than it sounds.

Although we don't have the need for a full DI framework yet, we should strive to write code in a one amenable to future changes. There are certain exceptions to this: debug logging, and other similar tools for developers don't have to conform to this standard. In the general case though, all code should make all external dependencies explicit through DI-like programming style.

Read the article linked to above if you are interested, but otherwise, here is a quick example of how these things work. Suppose we want to write a LinkedList, but linked list nodes will be reused:

struct LinkedListNode {
 // ... 
};

class LinkedList {
  // ...
  template <typename T>
  void Add(T content) {
    // ...
    LinkedListNodeObjectPool::GetInstance().New(content);
  }
};

At this point, it is hiding the fact that it mutates the state of the global singleton LinkedListNodeObjectPool, and because we made an explicitly call to it, preventing modularized testing. Now consider a scenario where for some reason LinkedListNodeObjectPool hands out large chunks of memory (say, 1 MB each node), and we want to do scale testing on the operations of LinkedList (e.g. Insert, Delete concurrently), our test will have to run slowly because there is no way for us to change the LinkedListNodeObjectPool's memory behavior without changing the code. Whereas in an ideal world, we know that in this test, it doesn't matter what the content of these LinkedListNode are, and can get away with a fake object that just has the pointer field.

Suppose we have written the code instead in this way:

class LinkedList {
  LinkedList(LinkedListNodeObjectPool &pool) : pool_(pool) {}
  // ...
  template <typename T>
  void Add(T content) {
    // ...
    pool_.New(content);
  }

  //...
  LinkedListNodeObjectPool &pool_;
};

Everything still works. But now suppose we want to write the above test, we simply do:

// fake implementation with low memory overhead
class FakeLinkedListNodeObjectPool : public LinkedListNodeObjectPool {
  // ...
}

TEST(LinkedListTests, LargeTest) {
  FakeLinkedListNodeObjectPool fake_pool;
  LinkedList tested(fake_pool);
  // test on tested
}

When we execute the real program, presumably in main.cpp:

int main() {
  // ...
  LinkedListNodeObjectPool real_pool;
  LinkedList tested(real_pool);
  // ...
}

To sum it up, dependency injection states that object creation and the logic be separated. This allows us to change the object without changing the logic. Of course, in practice, there is no need to be as strict about this; it might make sense, for example, for objects to create and own objects when the owned object is essential to functionality (e.g. rarely do we want to change a std::vector implementation, so creating a vector member in the object is fine). However, if the object in question logically is a different component, and it might make sense for the logic to be tested apart from the object, write the code in this way.