-
Notifications
You must be signed in to change notification settings - Fork 0
Sheepdog Design
The architecture of Sheepdog is fully symmetric; there is no central node such as a meta-data server. This design enables following features.
-
Linear scalability in performance and capacity
When more performance or capacity is needed, Sheepdog can be grown linearly by simply adding new machines to the cluster. -
No single point of failure
Even if a machine fails, the data is still accessible through other machines. -
Easy administration
There is no config file about cluster's role. When administrators launch the Sheepdog daemon at the newly added machine, Sheepdog automatically detects the added machine and begins to configure it as a member of the storage system.
Sheepdog is a distributed storage system and provides an object storage (something like simple key-value interface) to Sheepdog client (QEMU block driver). See the later sections for more details of each Sheepdog component.
-
Object storage
Sheepdog is not a general file system. The Sheepdog daemons create a distributed object storage system for QEMU (The daemon name of Sheepdog is ''sheep''). We can store ''object'' to the storage system. A object is flexible-sized data and has a globally unique identifier. We can do read/write/create/delete operations to objects by specifying its identifier. The object storage is composed of ''gateway'' and ''object manager''. -
Gateway
The gateway receives I/O requests (object id, offset, length, and operation type) from the QEMU block driver, calculates the target nodes based on the consistent hashing algorithm, and forwards the I/O requests to the target nodes. -
Object manager
The object manager receives the forwarded I/O requests from the gateway, and executes read/write operations to its local disk. -
Cluster manager
The cluster manager manages node membership (detection of failed/added nodes and notification of node membership changes) and some operations which requires consensus between all nodes (vdi creation, snapshoting vdi, etc). Currently, we uses corosync cluster engine for the cluster manager. -
QEMU block driver
The QEMU block driver divides a VM image into fixed-sized objects (4 MB by default) and store them on the object storage via its gateway.
Each object is identified by globally unique 64 bit integer, and replicated to multiple nodes. QEMU block driver doesn't care about where to store the objects. The object storage system is responsible for managing where to place objects.
Sheepdog objects are grouped into four types.
-
data object
This contains the actual data of virtual disk images. The virtual disk images are divided into fixed-sized data objects. Sheepdog client basically accesses this object. -
vdi object
This contains the metadata of virtual disk images (e.g. image name, disk size, creation time, data object IDs belonging to the vdi, etc) -
vmstate object
This stores the vm state image of running VM, which is used when the administrator takes live snapshot. -
vdi attr object
We can store attributes for each VDI. This object stores them. The attribute is key-value style, something like an extend attribute of a regular file.
- 0 - 31 (32 bits): object type specific space
- 32 - 55 (24 bits): vdi id
- 56 - 59 ( 4 bits): reserved
- 60 - 63 ( 4 bits): object type identifier
Each VDI has a globally unique ID (vdi id), which is calculated from the hash value of the VDI name. The usage of lower 32 bits is as follows:
object type | the usage of lower 32 bits |
---|---|
data object | the index number in the virtual disk image |
vdi object | not used (filled with zero) |
vmstate object | the index number in the vm state image |
vdi attr objects | the hash value of the key name |
- data object
the chunk of the virtual disk image - vdi object
struct sheepdog_inode { char name[SD_MAX_VDI_LEN]; /* the name of this VDI*/ char tag[SD_MAX_VDI_TAG_LEN]; /* the snapshot tag name */ uint64_t ctime; /* creation time of this VDI */ uint64_t snap_ctime; /* the time snapshot is taken */ uint64_t vm_clock_nsec; /* vm clock (used for live snapshot) */ uint64_t vdi_size; /* the size of VDI */ uint64_t vm_state_size; /* the size of vm state (used for live snapshot) */ uint16_t copy_policy; /* reserved */ uint8_t nr_copies; /* the number of object redundancy */ uint8_t block_size_shift; /* info about the size of the data object */ uint32_t snap_id; /* the snapshot id */ uint32_t vdi_id; /* the vdi id */ uint32_t parent_vdi_id; /* the parent snapshot vdi id of this VDI */ uint32_t child_vdi_id[MAX_CHILDREN]; /* the children VDIs of this VDI */ uint32_t data_vdi_id[MAX_DATA_OBJS]; /* the data object IDs this VDI contains*/ };
If snap_ctime is non-zero, the VDI is a snapshot. If the length of string name is zero, the VDI is deleted.
-
vmstate object
the chunk of the vm state image -
vdi attr object
The first SD_MAX_VDI_ATTR_KEY_LEN bytes (256 bytes) is the key name of this attribute. The rest of the object is the value of this attribute.
From the view of how to access objects, we can also categorize Sheepdog objects into two groups.
-
writable one VM can write and read the object but other VMs cannot access.
-
read-only object (e.g. data objects of a snapshot VDI) No VM can write the object but any VMs can read the object.
This means that virtual machines cannot share the same volume at the same time. This restriction avoids write-write conflicts and significantly simplify the implementation of sheepdog storage system.
Sheepdog object storage can accepts copy-on-write requests; when the clients send a create and write request, they can also specify the base object (the source of the CoW operation). This is used for snapshot and clone operations.
Sheepdog uses consistent hashing to decide where to store objects. Consistent hashing is a scheme that provides hash table functionality, and the addition or removal of nodes does not significantly change the mapping of objects. I/O load is balanced across the nodes by features of hash table.
Data replication of Sheepdog is simple. We assume that there is only one writer, so write collision cannot happen. Clients can send write request to the target nodes in parallel, and send read requests to one of the target nodes if clients order the I/O requests in themselves.
The gateway calculates the target nodes with consistent hashing, and sends write requests to all of the target nodes. The write request is successful only when all replicas can be successfully updated. It is because if one of the replicated objects are not updated, the gateway could read the old data from the not-updated object for the next time.
The gateway calculates the target nodes with consistent hashing, and sends a read request to one of the target nodes.
- fix object consistency
The consistency of the replication could be broken if the node crash during forwarding write I/O requests, so the gateway tries to fix the consistency when it reads object for the first time; read a whole object data from one of the target node and overwrite all replicas with it.
Sheepdog stores all node membership histories. We call the version number of the histories ''epoch'' (See also the 'Object Recovery' section). When the gateway forwards I/O requests to the target node and the latest epoch number doesn't match between the gateway and the target node, the I/O requests fail and the gateway retries the requests until the epoch numbers match. This is necessary to keep a strong consistency of replicated objects.
I/O retry can also happen when the target nodes are down and failed to complete I/O operations.
The object manager stores objects to the local disk. Currently, it stores each object as one file. This is quite simple. We could use other DBMS (like BerkeleyDB, Tokyo Cabinet, etc) as the object manager, but not supported now.
Objects are stored in the following path:
/store_dir/obj/[epoch number]/[object ID]
All object files also have an extended attribute sheepdog.copies, which specifies the number of the object redundancy.
When the sheep daemon fails during write operations, objects could be partially updated. Basically, this is not a problem because if the VM doesn't receive the success, there is no guarantee about the content of the written sectors. However, with regards to vdi objects, we must update in all or nothing way because if the vdi objects are updated partially, the metadata of VDI could broken. To avoid this problem, we use journal for write operations against vdi objects. The processes of journaling is quite simple.
- create a journal file "/store_dir/journal/[epoch]/[vdi object id]"
- write a data to the journal file first
- write a data to the vdi object
- remove the journal file
In most cases, Sheepdog clients can access their images independently because we do not allow for clients to access the same image at the same time. But some VDI operations (e.g. cloning VDI, creating VDI) must be done exclusively because the operations update global information. To implement this without central servers, we uses Corosync cluster engine.
In future, we'll extend Sheepdog to support other systems as the cluster manager.
(this section is under construction)
Sheepdog volumes are divided into 4 MB data objects. The object of newly created volume are not allocated at all, that is, only written objects are allocated.
First of all, QEMU block driver reads a vdi object from the Object storage through the gateway in bdrv_open().
The block driver calculates the data object id from the requested sector offset and size, and sends requests to the gateway. When the block driver sends write requests to the data object which is not belongs to the current VDI, the block driver sends CoW requests to allocate a new data object.
We can attach a snapshot VDI to the QEMU. When the block driver sends the write request to the snapshot VDI for the first time, the block driver creates a new writable VDI as a child of the snapshot, and sends requests against the new VDI.
When looking up the VDI object,
- calculate a vdi id from the hash value of the vdi name
- calculate a vdi object id from the vdi id
- send a read request to the vdi object
- if the vdi is not the requested one, increment the vdi id and retry to send a read request
Snapshot and cloning operation is quite simple:
- read a target VDI
- create a new VDI which has the same content as the target object VDI
- set the ''parent_vdi_id'' of the new VDI to the target VDI id
- set the ''child_vdi_id'' of the target VDI to the new VDI id
- set the ''snap_ctime'' of the target VDI to the current time then, the new vdi becomes the current vdi object.
When we create a snapshot of the running VM (live snapshot), we also stores the vm state info to the vmstate objects.
TODO: currently, reclaiming of unused data objects is not invoked until all relevant VDI objects (all relative snapshot VDIs and cloned VDIs) are deleted.
After all relevant VDIs are deleted, Sheepdog deletes all data objects of the VDIs, and set the null string to the name of the vdi objects.
Sheepdog stores histories of node membership in the stored directory. The path name is
/store_dir/epoch/[epoch number]
Each file contains the list of nodes info (IP address, port number, the number of virtual nodes) at the epoch.
- Receive all stored object IDs from all nodes
- Calculate which objects to have
- Create the object IDs list file "/store_dir/obj/[the current epoch]/list"
- Send a read requests to get objects whose ID is in the list file. The requests are sent to the node which had the object at the previous epoch.
- Store the object to the current epoch directory
If QEMU sends I/O requests against objects which are not recovered yet, Sheepdog blocks the requests and recovers the objects preferentially.
All requests of Sheepdog consist of the fixed-sized header part (48 bytes) and the flexible-sized data part. The header includes a protocol version, an operation code, an epoch number, a length of the data, etc.
operation code | description |
---|---|
SD_OP_CREATE_AND_WRITE_OBJ | Sends a request to create a new object and write data to it. If the object already exists, this operation fails. |
SD_OP_READ_OBJ | Sends a request to read data from a object. |
SD_OP_WRITE_OBJ | Sends a request to write data to a object. If the object does not exist, this operation fails |
SD_OP_NEW_VDI | Sends a vdi name to the object storage and create a new vdi object. This returns the unique vdi id of the vdi in the response. |
SD_OP_LOCK_VDI | Same as SD_OP_GET_VDI_INFO. |
SD_OP_RELEASE_VDI | Not used now. |
SD_OP_GET_VDI_INFO | Get information about the vdi (e.g. vdi id). |
SD_OP_READ_VDIS | Get the list of vdi IDs which are already used. |
operation code | description |
---|---|
SD_OP_DEL_VDI | Delete the requested VDI. |
SD_OP_GET_NODE_LIST | Get the list of sheepdog nodes. |
SD_OP_GET_VM_LIST | Not used now. |
SD_OP_MAKE_FS | Create a sheepdog cluster. |
SD_OP_SHUTDOWN | Stop a sheepdog cluster. |
SD_OP_STAT_SHEEP | Get information about local disk usage. |
SD_OP_STAT_CLUSTER | Get information about the sheepdog cluster |
SD_OP_KILL_NODE | Abort the sheep daemon. |
SD_OP_GET_VDI_ATTR | Get a vdi attr object id. |
operation code | description |
---|---|
SD_OP_REMOVE_OBJ | Removes the object. |
SD_OP_GET_OBJ_LIST | Get the list of object IDs which are stored on the target node. |