Documentation is still very much a work-in-progress
Cluman is the server component of the Haven project. This components manages the clusters via communication with the agent. The name comes from CLUster-MANager.
- Cluman manages the following modules (data is stored in
etcd
(runetcdctl ls /cluman/
for checking)):/nodes
- list of connected nodes/clusters
- list of clusters (currently onlyRealCluster
stored here)/containers
- list of containers/applications
- list of compose applications - applications are made of one or more containers instantiated by a compose file/docker-registry
- list of user-added registries and Docker Hub/pipelines
- list of pipelines (To be implemented)
- It gathers the metric information from the nodes (part of info provided by agent) and the containers
and save it in the file queues at
${dm.file.fbstorage}
(default${java.io.tmpdir}/cluman/fbstorage
) directory. - It provides an API to interact with the platform (see http://$MASTER_IP:8761/swagger-ui.html) and user interface.
Cluman has the follow entities:
-
NodesGroup
- a group of nodesNodesGroup has 'features' which may be used for resolving into these group types:
SWARM
- nodes in this group type are grouped together by a single 'swarm' service. We consider groups with this feature as 'cluster' or 'real cluster'.FORBID_NODE_ADDITION
- this group type is a meta group created by the system (For example, "orphan" mentioned below.) No modification is allowed for these NodesGroup.
In addition, Cluman has some pre-defined NodesGroup (all of them are them stored in
DiscoveryStorage.SYSTEM_GROUPS
):DiscoveryStorage.GROUP_ID_ALL
- this group contains all currently on-line nodes.DiscoveryStorage.GROUP_ID_ORPHANS
- this group contains all nodes that do not belongs to anyRealCluster
.
-
RealCluster
- a type of NodeGroup supported asSWARM
service. Do not confuse it with 'real cluster', becauseRealCluster
- the type name of domain object which is represent 'real cluster' in Cluman. -
Node
- node of cluster, Cluman does not differ node which containscluster-manager
container from others.
When Cluman is started, it
- Sets up a
DiscoveryStorageImpl
. - Reads from the etcd storage and loads list of registered
RealCluster
. - For each cluster cluman, it runs
swarm
instance throughDockerServices.getOrCreateCluster()
.
Docker Swarm is used for managing the NodesGroups (wrapped in DockerServiceImpl
) for Real Cluster
and VirtualDockerService
.
For gathering nodes' events from Docker services, Cluman connects to each node directly. These connections are
stored in DockerServices
. The registered nodes' information are stored in NodeStorage
.
When a node agent sends data to NodeStorage
through TokenDiscoveryServer
, the storage will add additional node reference into the
Swarm part of etcd tree via NodeStorage.updateSwarmRegistration
).
Cluman registers node through agent but also use information about node from docker info.
All gathered info are saved in NodeRegistrationImpl
. Data about node health and metrics is published as NodeMetrics
.
Node has two main flags health.healthy
and on
. The difference between the two are:
on
- It shows the online node status. The value is true when the node agent send ack in a specified time. If the timeout is exceeded then node is immediately set to off (on=false
). SeeNodeRegistrationImpl.isOn
for details. Its flag ignores the status of the node's Docker service.healthy
- Its value is derived from the Docker service status. The Docker developers can declare it as 'engine is unreachable' but we may use the analysis of node metrics (for example storage space is exceeded, or hdd SMART errors).
Note that node.health (aka NodeMetrics) has time value based on the local node time.
Cluman uses options from different sources for creating new container:
- API
- Compose like (yml) or properties file from git, examples: containers-configuration where dev is cluster name
- dm.image.configuration.git.url=https://.git
- dm.image.configuration.git.username=
- dm.image.configuration.git.password=
- Image labels with
arg.
prefix, example:LABEL arg.memory=512M
LABEL arg.restart=always
LABEL arg.ports=8761:8761
Application uses Docker Compose as the backend. Each application contains:
- String name
- String cluster
- File initFile
- Date creatingDate
- List containers
TODO: More details
Cluman has global instances of MessageBus
. Each instance has a unique ID, usually its id
can be obtain from
static field of event class: <EventClass>.BUS
.
List of global buses:
- bus.cluman.dockerservice -
DockerServiceEvent
, notifiesDockerServiceInfo
events. - bus.cluman.log.application -
ApplicationEvent
, notifies 'applications' events. - bus.cluman.log.registry -
RegistryEvent
, notifies registry adding and deletion events. - bus.cluman.node -
NodeEvent
, notifies node status updates, which are derived from haven-agent requests. - bus.cluman.log.docker -
DockerLogEvent
, notifies proxy events from Docker service, seeDockerServices.convertToLogEvent
- bus.cluman.log.nodesGroup -
NodesGroupEvent
, notifies NodesGroup creations and deletions. - bus.cluman.erorrs -
LogEvent
, bus aggregate messages from other buses withWithSeverity.getSeverity() >= WARNING
, also has history. - bus.cluman.pipeline -
PipelineEvent
notifies pipeline changes. - bus.cluman.job -
JobEvent
, notifies changes of eachJobInstance
changes, containsJobInfo
, can be caused byJobInstance.send()
.
Many events has action
field which has values described in StandardActions
class:
- create - some object will be created, note that for example
DockerService
cann't becreate
, but can bestart
- update
- delete
- start - applicable for some objects which have run state, containers, jobs, processes
- stop
- die - unexpected
stop
; usually it mean error - online - it and below action applicable for objects which used through network: node, Docker service and etc.
- offline
Events also have severity
fields . It can have INFO
, WARNING
and ERROR
status.
API is published at http://$MASTER_IP:8761/swagger-ui.html URL.