Skip to content

Cluster management: good practices

liuy edited this page Dec 11, 2012 · 5 revisions

Environment

Sheepdog often use more FDs than you think. The larger scale, the bigget set of FDs will be used in some period of time. So make sure you have big enough limit on FDs

$ ulimit -n

If you see '1024' which is the default value on most distributions, please set it as big as possible.

To help debug crashed Sheepdog of any case, please allow your systems to dump core file. And most importantly, before you throw your blames on Sheepdog, it is nice of you to paste gdb output from core file (the core file of Sheep is default located at /path/to/store/core) by

$ gdb /path/to/sheep /path/to/core_file
$ (gdb console) bt full

General remarks

Sheepdog uses a zone concept for data replication. In the default configuration, each node in the cluster is one zone and sheepdog replicate the vdi objects to the number of zones defined by the number of copies (X).

To keep a working cluster, never kill sheeps in more than X-1 zones at a time, or during recovery is not finished. If you kill X or more zones at a time, you will have no access to some of the objects. It never mind, if you kill the complete zone, or only a few sheeps in X different zones at a time, there will '''always''' some objects, that are only hold by these specific sheeps, only the amount of inaccessible objects will differ.

Including up to version 0.4 you will '''loose your data''' in this case! With the introduction of plain_store and the rework of farm to use it as core in version 0.5, you wont loose data, but as long as the zones are not restarted, you cant access the objects that are only available on this zones and your VM get a I/O Error.

Make sure your UPS auto-shutdown (eg. apcupsd config) scripts do the correct job!

Upgrading the nodes (outdated, help is needed to work out a better process for latest sheep)

The update scenario depends if you need a running cluster the whole time, or if you can plan a complete shutdown for some time.

Without downtime

If you need to run the cluster all the time, you have to:

  • kill the sheeps on one node, make the update and restart the sheeps.

  • After this, wait for recovery to complete and proceed with the next node.

  • After finishing with all nodes run ''collie cluster cleanup'', this removes obj no longer needed on the nodes after successful recovery.

With downtime

If you have a timeframe to shutdown the cluster completely:

  • use ''collie cluster shutdown'' (shut down all connected qemu instances before) to stop all sheeps on all nodes which leaves the cluster in a clean state,

  • then make the updates on all nodes and

  • restart the sheeps, the cluster starts working again, if all original inhabitants are back alive on the farm.

Migration