Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minor text fixes #21

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 15 additions & 19 deletions docs/config.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#SIMOORG - CONFIG FILES
# SIMOORG - CONFIG FILES
This document provides a quick overview of the various configuration files currently used by Simoorg. Simoorg expects path to the config directory as the first console argument and a standard simoorg config directory should have the following structure
```
configs/
Expand All @@ -18,7 +18,7 @@ configs/
Next we will go through each one of these configurations in details


##API CONFIG api.yaml
## API CONFIG api.yaml
This is our main api config file, this needs to be passed to both Moirai process and Api server as well. This is a yaml file, which is mainly used to store input named pipe location for Moirai process. It may be used to contain more config items in the future as the api functionalities are extended.


Expand All @@ -29,14 +29,14 @@ moirai_input_fifo: '/tmp/moirai.fifo'

```

##FATE BOOKS fate_books/*
## FATE BOOKS fate_books/*
Fate Book is a collection of configurations used to to describe failures to be induced against your service. Each service should have a unique Fate Book associated with it. Upon starting up, Simoorg scans configs/fate_books subdirectory for files with .yaml extension. Each qualified file is treated as Fate Book and used to instantiate observers that are watching and executing failures based on the conditions defined in a Fate Book.
Fate Books are human readable and can be edited using a conventional editor.

###Fate Book Format
### Fate Book Format
Format of the Fate Books are chosen to be YAML for its simplicity yet being capable to formally describe nested objects in a human readable form.

###Fate Book Contents
### Fate Book Contents
Each service that needs to receive failure commands from the Failure Inducer, has to have a Fate Book associated with it. Below there is a sample Fate Book for an example service (called test-service)

```yaml
Expand Down Expand Up @@ -107,16 +107,16 @@ failures:

```

###Fate Book Sections
### Fate Book Sections
Next we take a closer look at the various sections of the fatebook


####service:
#### service:
Required : Yes
Default: None
The value for service key is used to uniquely identify the service being specified in that fate book. Simoorg enforces that no two fate books can have the same value for the service key.

####topology:
#### topology:
Required : Yes
All values related to topology plugin should be stored under this section. We expect only two values under this key, they are as follows

Expand All @@ -130,7 +130,7 @@ The name of the topology plugin should be same as plugin class (please check the
topology_config :
Any plugin specific values should be added to this section.Simoorg expects the config to be contained inside the main config directory and the path provided here is relative to the config root

####logger
#### logger
Required : Yes

Contains the logging related information, we expect it to contain the following keys
Expand All @@ -150,7 +150,7 @@ This key is used to enable console logging
log_level :
Simoorg expects the value for this key to be "WARNING", "INFO", "VERBOSE" or "DEBUG"

####healthcheck
#### healthcheck
Required : Yes

In this section we list all of our health check related configs. the various keys we expect in this section are as follows
Expand All @@ -168,7 +168,7 @@ Depends on what plugin you use. In case of Defaulthealthcheck this is the absolu
plugin_config :
Place to specify any plugin specific configurations, Currently is None Default Health check plugin.

####destiny
#### destiny
Required : Yes

This section is responsible for listing all the scheduler specific information. We expect the following keys to be present under the destiny section
Expand All @@ -187,7 +187,7 @@ scheduler_plugin| the name of the scheduler plugin| Yes | None|

Please check the plugins document to better understand the plugin names. In addition to the keys listed above, the "scheduler_plugin" key could also contain any plugin specific config, also the failure name given in "scheduler_plugin"->failures->"failure_name" should have a valid failure definition in the failures sections of the fate book

####failures
#### failures
This section includes a list of failure definition and each item in the list should contain the following keys

Key name | Description | Mandatory | Default |
Expand All @@ -204,11 +204,10 @@ restor_handler->args | The args passed to the handler during failure revert | Ye
wait_seconds | The wait seconds between failure induction and failure revert | Yes | None |


###Plugin Configs
=================
### Plugin Configs
These are config files that may be specific to some plugin. Since these configs are closely related to the plugins, we will mainly be covering configs for the plugins that are shipped out of the box.

####Handler Configs
#### Handler Configs

For any handler plugin (lets assume the handler name is test_handler), we expect the config to be located in the path config/plugins/handler/test_handler/test_handler.yaml, the config contents greatly depends on the specific handler.The ShellScriptHandler plugin file for example, looks like this :

Expand All @@ -217,7 +216,7 @@ For any handler plugin (lets assume the handler name is test_handler), we expect
host_key_path: ~/.ssh/known_hosts
```

####Topology Configs
#### Topology Configs
The location of the topology plugin is usually provided under the topology section of the fate book. Again the content of this configuration file depends heavily on the specific plugin.But here are two sample configuration files for StaticTopology and KafkaTopology plugins respectively. In StaticTopology we list all the servers present in the service under the key node
```
# file: configs/plugins/topology/static/topo.yaml
Expand Down Expand Up @@ -261,6 +260,3 @@ kafka_host_resolution:
LEADER: {Topic: "Topic1"}

```



22 changes: 11 additions & 11 deletions docs/design.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#SIMOORG - HIGH LEVEL DESIGN
# SIMOORG - HIGH LEVEL DESIGN
This document describes high level design of Simoorg: Linkedin’s Failure Inducing Framework. The rationale behind developing Simoorg is to have a simple yet powerful and extensible failure inducing framework. Simoorg is written in Python - Linkedin's lingua franca for solving operational challenges.


Expand All @@ -11,7 +11,7 @@ Key points of Simoorg are:
* Comprehensive logging to help SREs and developers to get valuable insights about how their application of choice reacts to failures.
* Support of heterogeneous infrastructure by introducing flexible execution handlers. New execution handlers are easy to plug in with minimal efforts.

##From a bird's eye view
## From a bird's eye view

Simoorg’s main job is to induce and revert failures against a service of your choice. The failures are induced based on the scheduler plugin type you wish to use. Simoorg comes with a non-deterministic scheduler configured, which generates failures at a random time. Although the failures are generated at a random time, you can still set a few limitations like: total run duration and min/max gap between failures. Each failure is followed by a revert, ensuring that the cluster we operate against is back to a clean state. Simoorg ensures logging of important metrics like failure name, impact and the time of the impact to help SREs and developers to reason about fault tolerance of their application of choice.

Expand All @@ -33,7 +33,7 @@ In the subsequent paragraphs we will cover important components and talk about h
* Topology
* Api Server

###Moirai
### Moirai

Moirai is a single threaded process that monitors and manages individual Atropos instances using standard UNIX IPC mechanism and python queues. It also provides entry points for the Api Server to retrieve information about the various services being tested. Moirai takes configs directory path as an input argument and bootstraps the framework by reading configuration files in the configs directory. Configs directory contains:

Expand All @@ -48,7 +48,7 @@ Here each Atropos can communicate specific information to Moirai, with the help
![High level Design](/docs/images/high_level.jpg)


###Atropos
### Atropos

Upon initialization, each Atropos instance reads one Fate Book ([link][/docs/configs.rst]) and depending on the destiny defined in the Fate Book sleeps until requirements are met. Once requirements are met, Atropos induces a random failure, waits for the specified interval and reverts to bring the cluster to a clean state. There are two types of requirements to be met before inducing a failure:

Expand All @@ -60,12 +60,12 @@ Each Atropos instance has its own instance of a Scheduler which is in charge of
Apart from the Scheduler, each Atropos instance has its own instance of a Handler, Logger and Journal. The high level diagram reflecting Atropos and its components is as follows:
![Atropos Components](/docs/images/atr1.png)

###Scheduler
### Scheduler

A Scheduler generates a failure plan and keeps track of time. Currently Simoorg ships only with a Non-deterministic scheduler. The Non-deterministic scheduler randomly generates dispatch times and associates them with random failures. We refer to this sequence of timestamp and failures internally as a Plan. Once generated, the Plan is passed to Atropos.


###Handler
### Handler

Each failure definition should have a handler associated with it. A Handler is referred to by its name within a failure definition and is responsible for inducing and reverting failures. The table below lists supported handlers and handlers planned to be available in future:

Expand All @@ -77,19 +77,19 @@ AWS|AWS API calls|not supported|TBD|
Rackspace|Rackspace API|not supported|TBD|


###Journal
### Journal

Each Observer has a separate Journal instance. The Journal is responsible for:

* Keeping track of the internal state of Atropos such as: current impact and impact limits
* Persisting the current state of Atropos to support session resumption
* Resuming state after a crash

###Logger
### Logger

Each Atropos has a separate Logger instance. The Logger is used to log and store arbitrary messages spit out at various points of Plan execution.

###HealthCheck
### HealthCheck

Healthcheck is an optional component that allows you to control the damage inflicted against your service. If enabled, Atropos kicks off the healthcheck logic defined in the Fate Book before inducing a failure. The Healthcheck component needs to return success in order for the failure run. Otherwise the Scheduler skips the current failure cycle. This ensures that we are not aggravating any existing issues and lets the cluster fire self-healing routines and recover. If a healthcheck is not defined, failures will be induced as scheduled assuming the cluster was able to recover.

Expand All @@ -109,7 +109,7 @@ The best practice is to leverage your current monitoring system to identify the

We also ship a simple kafka HealthCheck out of the box. This plugin considers a cluster to be healthy if the under replicated partition count is zero for all the nodes in cluster. The plugin also depends on the kafka topology config file to get information about the cluster.

###Topology
### Topology

The Topology component is responsible for identifying and keeping the list of nodes that constitute your service. In most cases this is just a list of servers present in your cluster. The Topology component is also responsible for choosing a random node from the list and handing it over to Atropos. We ship a static topology and Kafka topology plugins with our source code.

Expand All @@ -128,7 +128,7 @@ Another example of topology is Kafka topology. It is a custom Topology component
* RANDOM_LEADER - Where the node is a leader for a random topic and a random partition
* LEADER - Where the node is a leader for a specific topic and a specific partition (if you skip the partition it randomly selects a partition)

###Api Server
### Api Server

Simoorg provides a simple API interface based on Flask. The API server communicates with Moirai process through linux FIFOs, so it is necessary that the Api Server is started on the same server as the Moirai process. The API endpoints currently supported by our systems are

Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
#SIMOORG
# SIMOORG
4 changes: 2 additions & 2 deletions docs/low_level.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#LOW LEVEL FAILURES
# LOW LEVEL FAILURES
Libfui provides an easy way to induce low level failures to any POSIX call in your application. To be able to use low level failures against POSIX calls, we require the application to be started under the control of libfiu. The best practice is to use these failures either on your staging/dev clusters or run on select nodes from your production cluster.

Please check the libfiu website (https://blitiri.com.ar/p/libfiu/) to understand how to build and install libfiu on your servers. Once the libfiu packages are installed, please restart your application under the control of libfiu. You can achieve this using the fiu-run command ( see https://blitiri.com.ar/p/libfiu/doc/man-fiu-run.html ), the command should look something like the following
Please check the [libfiu website](https://blitiri.com.ar/p/libfiu/) to understand how to build and install libfiu on your servers. Once the libfiu packages are installed, please restart your application under the control of libfiu. You can achieve this using the [fiu-run command](https://blitiri.com.ar/p/libfiu/doc/man-fiu-run.html), the command should look something like the following
```
fiu-run -x -c $COMMAND
```
Expand Down
12 changes: 6 additions & 6 deletions docs/plugins.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#How to create a new plugin:
# How to create a new plugin:
In simoorg, we have four types of pluggable component namely Topology, Healthcheck, Scheduler and Handler. Even though we ship a few standard plugins of each category, we understand that it will not meet the requirements of all the potential customers. So one our guiding design principles has been to ensure that system is easily extensible. So in this document, we will be detailing the various steps to be taken to create a new plugin.

##Topology
## Topology
First we start with the the topology plugin. Simoorg relies on the topology plugin to retrieve information about the individual nodes of a service. The arguments that are passed to any topology plugin is
*Args:*
input_file - the config file to be read by the plugin
Expand Down Expand Up @@ -56,7 +56,7 @@ kafka_host_resolution:
node_type_7:
RANDOM_BROKER: {Topic: "Topic3"}

````
```
* This class reads the config file and loads it in memory data structure. At the
time of failure induction, it returns a random host (broker host name) to the
caller method. The selection of this host depends upon the kind of node selected
Expand All @@ -67,7 +67,7 @@ In the above Kafka Topology plugin example, it is possible to modify the config
Path to KafkaTopology plugin : simoorg.plugins.topology.KafkaTopology.KafkaTopology


##HealthCheck :
## HealthCheck :
Healthcheck plugin is responsible for checking the health of the target cluster.
*Args:*
script - Any external script to be used by the plugin
Expand All @@ -92,7 +92,7 @@ Let’s take an example of *KafkaHealthCheck plugin* :

If users want to use a shell script, that will do the HealthCheck on the target cluster, they can use the DefaultHealtCheck plugin in the fate book and pass it the customized shell_script. The DefaultHealthCheck plugin like KafkaHealthCheck plugin implements the check() method that will return true if the target cluster is healthy, else false otherwise.

##Scheduler:
## Scheduler:
The Scheduler plugin is responsible for creating the plans that an atropos process will be following. A plan as received by atropos should be a list of single item dictionaries, where the dictionary has the failure name as the key and the trigger time as the value.
*Args:*
destiny_object - A dictionary containing the contents of the plugin key of the destiny
Expand All @@ -116,7 +116,7 @@ Let us consider the example of NonDeterministicScheduler plugin:

There are a number of fully implemented methods in BaseScheduler, that you can use in your implementation to better access the destiny object.

##Handler
## Handler
Handler is the plugin responsible for actually inducing and reverting the failures
*Args:*
config_dir - This is the path to the simoorg config directory
Expand Down
13 changes: 6 additions & 7 deletions docs/user_guide.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
#Introduction
# Introduction
This document describes the process of setting up and running simoorg against an application cluster.
##Installation
## Installation
The system requirements for Simoorg are as follows
OS: Any Linux distribution
Python Version : Python-2.6
Additional Python Modules: multiprocessing, yaml, paramiko

Simoorg is currently distributed via pip, so to install the package please run the following command
````
```
(sudo) pip install simoorg
```
If you want to work with the latest code, please run the following commands
Expand All @@ -25,13 +25,13 @@ Once you have confirmed that the tests have passed, you can install the code by
```
If you are planning to use ssh handler plugin to induce failures against a specific service cluster, please ensure that the user you are using to run simoorg have Passwordless SSH access to all the nodes in the cluster. You should also ensure that any failure scripts you plan to use are already present on all the nodes in the target service cluster.

##Basic Usage
Simoorg is started using the command *simoorg* which takes the path to your config directory as the only argument. Please check the config document ([link](/docs/config.md)) to better understand the configuration files. The sample config directory packaged with the product can be used to set up your configs.
## Basic Usage
Simoorg is started using the command *simoorg* which takes the path to your config directory as the only argument. Please check the [config document](/docs/config.md) to better understand the configuration files. The sample config directory packaged with the product can be used to set up your configs.
```
Ex: simoorg ~/configs/
```

##Usage Example
## Usage Example
In this section of the document, we will be describing how to use Simoorg against a kafka cluster. For this examples we will be running three predefined failures (graceful stop, ungraceful stop and simulate full GC) on random nodes in the cluster using the Shell script handler plugin. We will be executing the failures in a random manner using the non deterministic scheduler. We will also be using the Kafka Topology plugin and Kafka HealthCheck plugin. Both of these plugins are packaged with the product and are ready to use out of the box.

Before we start , we need to make sure that all the required failure scripts (the ones required for these failure scenario is present in the repo under Simoorg/failure_scripts/base/) are present on all the broker nodes in the kafka cluster. Let’s assume that the script is present in the location ~/test/failure_scripts/base/ on the kafka brokers, we will need this path later when we are updating our configurations.
Expand Down Expand Up @@ -182,4 +182,3 @@ Where ~/kafka_configs/ is the path to your failure inducer configs. For longer r
gunicorn 'simoorg.Api.MoiraiApiServer:create_app("~/kafka_configs/api.yaml")'
```
Where api.yaml should contain a valid path for the named pipe used by both the api server and Simoorg. Our current implementation of api, relies on the simoorg process to retrieve all information and do not serve any data once the process is dead. Please check the design doc to better understand the various REST API endpoints