diff --git a/docs/assets/images/Host-Aggregates.png b/docs/assets/images/Host-Aggregates.png new file mode 100644 index 00000000..b49d3af2 Binary files /dev/null and b/docs/assets/images/Host-Aggregates.png differ diff --git a/docs/assets/images/Multiple-Host-Aggregates.png b/docs/assets/images/Multiple-Host-Aggregates.png new file mode 100644 index 00000000..937893cf Binary files /dev/null and b/docs/assets/images/Multiple-Host-Aggregates.png differ diff --git a/docs/assets/images/OpenStack-Logo-Vertical.svg b/docs/assets/images/OpenStack-Logo-Vertical.svg new file mode 100644 index 00000000..003be44f --- /dev/null +++ b/docs/assets/images/OpenStack-Logo-Vertical.svg @@ -0,0 +1 @@ +OpenStack_Logo_Vertical diff --git a/docs/assets/images/kubernetes-stacked-color.svg b/docs/assets/images/kubernetes-stacked-color.svg new file mode 100644 index 00000000..a16970c1 --- /dev/null +++ b/docs/assets/images/kubernetes-stacked-color.svg @@ -0,0 +1 @@ + diff --git a/docs/assets/images/source/Aggregates.graffle b/docs/assets/images/source/Aggregates.graffle new file mode 100644 index 00000000..310e18ea Binary files /dev/null and b/docs/assets/images/source/Aggregates.graffle differ diff --git a/docs/deployment-guide-welcome.md b/docs/deployment-guide-welcome.md index f2da6b12..335574c1 100644 --- a/docs/deployment-guide-welcome.md +++ b/docs/deployment-guide-welcome.md @@ -1,4 +1,4 @@ -![Rackspace Cloud Software](assets/images/ospc_flex_logo_red.svg){ align=left : style="max-width:175px" } +![Rackspace Cloud Software](assets/images/ospc_flex_logo_red.svg){ align=left : style="max-width:350px" } # What is Genestack? @@ -12,5 +12,5 @@ Genestack’s inner workings are a blend of dark magic — crafted with [Kustomi platform is ready to go with batteries included. Genestack is making use of some homegrown solutions, community operators, and OpenStack-Helm. Everything -in Genestack comes together to form cloud in a new and exciting way; all built with opensource solutions +in Genestack comes together to form cloud in a new and exciting way; all built with Open Source solutions to manage cloud infrastructure in the way you need it. diff --git a/docs/openstack-cloud-design-az.md b/docs/openstack-cloud-design-az.md index 56e1b5ef..2a1c0aa9 100644 --- a/docs/openstack-cloud-design-az.md +++ b/docs/openstack-cloud-design-az.md @@ -12,8 +12,8 @@ Availability Zones are a logical abstraction for partitioning a cloud without kn Typically, a [region](openstack-cloud-design-regions.md) encompasses at least two, ideally more, availability zones (AZs). All AZs are fully independent, featuring redundant power, cooling, and networking resources, and are interconnected within a region via dedicated high-bandwidth, low-latency links. Connectivity between AZs in-Region are typically extremely fast[^1]. -!!! Info - See [this page](https://www.rackspace.com/about/data-centers){:target="_blank"} on the Rackspace website for information on how Rackspace deploys data centers to deliver these capabilities. +!!! Genestack + See the [data center](https://www.rackspace.com/about/data-centers){:target="_blank"} pages on the Rackspace website for information on how Rackspace deploys and manages data centers to deliver these capabilities. There is no standard for quantifying distance between AZs. What constitutes "in-Region" is defined by a the cloud provider, so when designing a cloud there is a lot of latitude.Distance between AZs depends on the region and the specific cloud provider[^2]. @@ -50,12 +50,15 @@ The addition of this specific metadata to an aggregate makes the aggregate visib - By default a host is part of a default Availability Zone even if it doesn’t belong to a host aggregate. The name of this default availability zone can be configured using the `default_availability_zone` config option. -(See [Host Aggregates](openstack-cloud-design-ha.md) for more information.) +!!! Note + See [Host Aggregates](openstack-cloud-design-ha.md) for more information. ### Availability Zones and Placement In order for [Placement](https://docs.openstack.org/placement/latest/){:target="_blank"} to use to honor Availability Zone requests, there must be placement aggregates that match the membership and UUID of Nova Host Aggregates that you assign as availability zones. An aggregate metadata key is used to controls this function. As of OpenStack 28.0.0 (Bobcat), this is the only way to schedule instances to availability-zones. +Administrators can configure a default availability zone where instances will be placed when the user fails to specify one. For more information on how to do this, refer to [Availability Zones](https://docs.openstack.org/nova/latest/reference/glossary.html#term-Availability-Zone){:target="_blank"}. + ## Availability Zones in Cinder OpenStack block storage – [Cinder](https://docs.openstack.org/cinder/latest/){:target="_blank"} – also supports the concept of Availability Zones (AZs.) Creating availability zones in Cinder is accomplished by setting a configuration parameter in `cinder.conf`, on the nodes where the `cinder-volume` service runs. @@ -107,7 +110,10 @@ By deploying HA nodes across different availability zones, it is guaranteed that ### Neutron Availability Zones and OVN -When using [Open Virtual Networking (OVN)](https://www.ovn.org/en/){:target="_blank"}[^6] aditional [special configuration](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html){:target="_blank"} is necessary to enable Availability Zones. +!!! Genestack + [Open Virtual Networking (OVN)](https://www.ovn.org/en/){:target="_blank"} is the networking fabric being used in [Genestack](infrastructure-ovn-setup.md). + +Additional [special configuration](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html){:target="_blank"} is necessary to enable Availability Zones when using OVN. ## Sharing Keystone Across Availability Zones @@ -122,4 +128,3 @@ As with Keystone, Glance can also be shared across Availability Zones. The [Regi [^3]: This is typical for hybrid cloud deployments where there is a need to have customer data centers separate from cloud provider data centers, but still a need for high-speed connectivity, such as for fail-over, disaster recovery, or data access. [^4]: The OpenStack [Placement](https://docs.openstack.org/placement/latest/){:target="_blank"} service and the scheduler work to prevent this from happening, but some actions can still attempt to do this, yielding unpredictable results. [^5]: For example: [NetApp](https://www.netapp.com/hybrid-cloud/openstack-private-cloud/){:target="_blank"} or [Pure Storage](https://www.purestorage.com/content/dam/pdf/en/solution-briefs/sb-enhancing-openstack-deployments-with-pure.pdf){:target="_blank"}. -[^6]: [Genestack](infrastructure-ovn-setup.md) is built using OVN. diff --git a/docs/openstack-cloud-design-dr.md b/docs/openstack-cloud-design-dr.md index e8117574..db4dc534 100644 --- a/docs/openstack-cloud-design-dr.md +++ b/docs/openstack-cloud-design-dr.md @@ -1,4 +1,133 @@ # Disaster Recovery for OpenStack Clouds -!!! Info To Do - Cloud design considerations for Disaster Recovery. +## Introduction + +When designing and deploying clouds using OpenStack, Disaster Recovery (DR) needs to be forefront in your mind. DR needs to be a part of the design and architecture from the start. Disasters can strike in various forms, ranging from the failure of a single node to a complete site outage. While built-in redundancy measures are essential for maintaining the resilience of production-scale OpenStack environments, the effectiveness of the recovery process largely depends on careful planning and a well-defined approach. + +## Understanding Disaster Scenarios and Potential Risks + +OpenStack environments are susceptible to a wide array of failure scenarios, each presenting unique challenges and potential risks to the cloud's stability and performance. Depending on the level where the failure occurs, it may be able to be easily mitigated due to redundancy and/or recovery processes, or it may lead to a more serious outage that requires unplanned maintenance. + +By gaining a thorough understanding of these scenarios, you can better prepare for and mitigate the impact of such failures on your OpenStack clouds. Some of the layers where most common disaster scenarios include: + +### Service Failures + +Service failures are when a particular OpenStack service (or supporting service[^1]) becomes unavailable. These are often attributed to software issues, operating system bugs, or failed OpenStack upgrades. These failures can affect critical cloud services such as Cinder, Nova, Neutron, or Keystone. The impact on instances varies depending on the affected service, with potential consequences ranging from deployment failures to service interruptions. + +### Controller Node Failures + +Hardware failures can lead to the complete outage of a Controller Node, whether virtual or physical. While Controller Node failures may not directly impact running instances and their data plane traffic, they can disrupt administrative tasks performed through the OpenStack agent. Additionally, the loss of the database hosted on the failed controller can result in the permanent loss of instance or service information. + +!!! Note + This is a different scenario than a Control Plane Failure. The impact of the failure of a Controller Node will usually be mitigated through Controller Node redundancy in the control plane. As long as there is no data corruption, service should be uninterrupted during recovery. + +### Compute Node Failures + +Compute Node failures are the most prevalent issue in OpenStack clouds – mostly because Compute Nodes make up the majority of the cloud, by node type population. Compute Node failures are often caused by hardware failures, whether disk, RAM, or other hardware failure. The primary risk associated with compute node failures is the potential loss of instances and their disk data if they are using local storage. + +!!! Info + This risk is not unique to OpenStack. Any cloud (or any compute environment at all) where storage is co-located with compute[^2] has this risk. + +### Network Failures + +Network failures can stem from various sources, including faulty SFP connectors, cables, NIC issues, or switch failures. These failures can impact both the data and control planes. Data plane NIC failures directly affect the instances using those NICs, while control-plane network failures can disrupt pending tasks such as reboots, migrations, and evacuations. + +The easiest way to account for this is to build redundancy at every level of your network: + +- **Redundant NICs** for each host → switch connectivity +- **Redundant Connections (e.g. LACP)** for each host → switch connectivity +- **Redundant Top-of-Rack (ToR) or Leaf Switches** for host → switch connectivity +- **Redundant Aggregation or Spine Switches** for switch → switch connectivity + +Having this level of redundancy won't eliminate failures, but it can massively limit or even eliminate service outage at a given level of your network, at least until maintenance can replace the affected hardware. + +### Instance Failures + +OpenStack instances, whether standalone or part of an application node, are prone to failures caused by human errors, host disk failures, power outages, and other issues. Instance failures can result in data loss, instance downtime, and instance deletion, often requiring redeployment of the affected instance or even the entire application stack that instance is part of. + +By recognizing and preparing for these potential disaster scenarios, organizations can develop comprehensive disaster recovery strategies that minimize the impact of such events on their OpenStack environments, ensuring greater system resilience and minimizing downtime. + +## Ensuring Controller Redundancy in OpenStack + +One of the fundamental design considerations in OpenStack is the implementation of a cluster with multiple controllers. A minimum of three controllers is typically deployed to maintain quorum and ensure system consistency in the event of a single server failure. By distributing services across multiple controllers, organizations can enhance the resilience and fault tolerance of their OpenStack environment. + +### Controller Deployment Strategies + +There are several standard practices for managing controller redundancy in OpenStack: + +- **Bare Metal with Containerized Services:** In this approach, each service is hosted in a container on separate bare metal servers. For example, the nova-scheduler service might run on one server, while the keystone service runs on another. This strategy provides isolation between services, potentially enhancing security and simplifying troubleshooting. This is the approach that [OpenStack-Ansible](https://docs.openstack.org/openstack-ansible/latest/){:target="_blank"} and [Kolla-Ansible](https://docs.openstack.org/kolla-ansible/latest/){:target="_blank"} take. + +- **Replicated Control Plane Services:** All control plane services are hosted together on each of the three or more servers. This replication of services across multiple servers simplifies deployment and management, as each server can be treated as a self-contained unit. In the event of a server failure, the remaining servers in the cluster continue to provide the necessary services, ensuring minimal disruption. This _hyperconverged_ approach is good for smaller OpenStack deployments, but starts to become problematic as the cluster scales. This is another approach that [OpenStack-Ansible](https://docs.openstack.org/openstack-ansible/latest/){:target="_blank"} takes, and one that Rackspace is currently using for on-prem OpenStack deployments. + +- **Kubernetes-Managed Containerized Workloads:** Kubernetes can be used to manage the OpenStack control plane services as containerized workloads. This approach enables easier scaling of individual services based on demand while offering self-healing mechanisms to automatically recover from failures. _This is the approach taken by [Genestack](deployment-guide-welcome.md)._ + +### Load Balancing and High Availability + +To ensure high availability and distribute traffic among controller nodes, load balancers such as HAProxy or NGINX are commonly used for most OpenStack web services. In addition, tools like Pacemaker can be employed to provide powerful features such as service high availability, migrating services in case of failures, and ensuring redundancy for the control plane services. + +Most deployment tooling currently achieves this via two patterns: + +- **Having multiple Controller Nodes:** This is what OpenStack-Ansible and Kolla-Ansible do today. They have multiple nodes and fail-over services individually when there is a problem and with service workload balancing for some services. Usually, this is implemented using a software load balancer like [HAProxy](https://www.haproxy.org/){:target="_blank"} or with hardware load balancers. + +- **Using a microservices-based approach:** _This is the approach that Genestack takes by deploying the OpenStack services inside of Kubernetes._ Kubernetes provides, autoscaling, load balancing, and failover capabilities for all of the OpenStack services. + +### Database and Message Queue Redundancy + +Special consideration must be given to mission-critical services like databases and message queues to ensure high availability. + +[MariaDB](https://mariadb.org/){:target="_blank"} or [MySQL](https://dev.mysql.com/community/){:target="_blank"} are the standard databases for OpenStack. To have redundancy, [Galera](https://galeracluster.com/){:target="_blank"} clustering can be used to provide multi-read/write capabilities. Regular database backups should be maintained and transferred to multiple locations outside the cluster for emergency cases. + +Message queues should be configured in a distributed mode for redundancy. The most common message queue used for OpenStack is [RabbitMQ](https://www.rabbitmq.com/){:target="_blank"}, which has [various](https://www.rabbitmq.com/docs/reliability){:target="_blank"} capabilities that can be implemented to provide reliability and redundancy. + +In some cases, with larger deployments, services might have to be separated and deployed in their own dedicated infrastructures. The OpenStack [Large Scale SIG](https://docs.openstack.org/large-scale/index.html){:target="_blank"} provides documentation on various scaling techniques for the various OpenStack services, as well as guidance around when it is appropriate to isolate a service or to scale it independently of other services. + +### Controller Redeployment and Backup Strategies + +To facilitate rapid redeployment of controllers in the event of a disaster, organizations should maintain backups of the base images used to deploy the controllers. These images should include the necessary packages and libraries for the basic functionality of the controllers. + +The backup strategy for Controller Nodes should consist of periodic snapshots of the controllers. These should be taken and transferred to safe locations to enable quick recovery without losing critical information or spending excessive time restoring backups. + +!!! Warning + You must backup _all_ controller nodes at the same time. Having "state skew" between the controllers has the potential to render the entire OpenStack deployment inoperable. Additionally, if you need to restore the control plane from a backup, it has the potential to differ from what is _currently running_ in terms of instances, networks, storage allocation, etc. + +Implementing robust controller redundancy strategies can enable you to significantly enhance the resilience and fault tolerance of your OpenStack deployment, minimizing the impact of controller failures, and ensuring the smooth operation of their cloud infrastructure. + +## Achieving Compute Node Redundancy in OpenStack + +Compute nodes are the workhorses of an OpenStack environment, hosting the instances that run various applications and services. To ensure the resilience and availability of these instances, it is crucial to design compute node redundancy strategies that can effectively handle failures and minimize downtime. + +Implementing a well-designed Compute Node redundancy strategy will enable you to significantly enhance the resilience and availability of OpenStack instances, minimize user downtime, and ensure the smooth operation of the cloud-based applications and services your users deploy. + +### Capacity Planning and Spare Nodes + +When designing compute node redundancy, it is essential to consider the capacity of the overcloud compute nodes and the criticality of the instances running on them. + +!!! Tip + A best practice is to always maintain at least one spare compute node to accommodate the evacuation of instances from a node that has failed or that requires maintenance. This is often referred to as the _**N+1**_ strategy. + +If multiple compute node groups have different capabilities, such as CPU architectures, SR-IOV, or DPDK, the redundancy design must be more granular to address the specific requirements of each component. + +### Host Aggregates and Availability Zones + +To effectively manage compute node redundancy, subdivide your nodes into multiple [Host Aggregates (HAs)](openstack-cloud-design-ha.md) and assign one or more spare compute nodes with the same capabilities and resources to each aggregate. These spare nodes must be kept free of load to ensure they can accommodate instances from a failed compute node. WHen you are creating [Availability Zones (AZs)](openstack-cloud-design-az.md) from host aggregates, you allow users to select where their instances are deployed based on their requirements. If a Compute Node fails within an AZ, the instances can be seamlessly evacuated to the spare node(s) within the same AZ. This minimizes disruptions and maintains service continuity. + +!!! Tip + You will want to implement _**N+1**_ for all Host Aggregates and Availability Zones so that each group of compute resources has some redundancy and spare capacity. + +### Fencing Mechanism and Instance High-Availability Policies + +For mission-critical deployed services that cannot tolerate any downtime due to compute node failures, implementing fencing mechanisms and instance high-availability (HA) policies can further mitigate the impact of such failures. + +!!! Info + OpenStack provides the [Masakari](https://docs.openstack.org/masakari/latest/){:target="_blank"} project to provide Instance High Availability. + +Defining specific High Availability (HA) policies for instances enables you to determine the actions to be taken if the underlying host goes down, or the instance crashes. For example, for instances that cannot tolerate downtime, the applicable HA policy in Masakari is "ha-offline," which triggers the evacuation of the instance to another compute node (the spare node.) To enable this functionality, the fencing agent must be enabled in Nova. + +Masakari is a great feature, but imposes architectural requirements and limitations that can be at odds with providing a large-scale OpenStack cloud. You may want to consider limiting your cloud to have a smaller-scale "High Availability" Host Aggregate or Availability Zone to limit the architecture impact (as well as the associated costs) of providing this feature. + +### Monitoring and Automated Recovery + +Continuously monitor the health and status of compute nodes to quickly detect and respond to failures. Having automated recovery mechanisms that can trigger the evacuation of instances from a failed node to a spare node based on predefined policies and thresholds, is one way to cut down on service emergencies. This automation ensures rapid recovery and minimizes the need for manual intervention, reducing the overall impact of compute node failures on the OpenStack environment. Like with all automation, it can be a delicate balance of risk and reward, so _test everything_ and make sure the added complexity doesn't increase the administrative burden instead of cutting it down. + +[^1]: e.g. MySQL/MariaDB, RabbitMQ, etc. +[^2]: There are various hyperconverged architectures that attempt to mitigate this, however co-locating storage with compute via hyperconvergence means that failure of a Compute Node _also_ is failure of a Storage Node, so now you are dealing with _multiple_ failure types. diff --git a/docs/openstack-cloud-design-ha.md b/docs/openstack-cloud-design-ha.md index 9eb0379e..1d4bf41c 100644 --- a/docs/openstack-cloud-design-ha.md +++ b/docs/openstack-cloud-design-ha.md @@ -1,16 +1,49 @@ -## Host Aggregates +# Host Aggregates -Host Aggregates are a way of grouping hosts in an OpenStack cloud. This allows you to create groups of certain types of hosts and then steer certain classes of VM instances to them. +[Host Aggregates](https://docs.openstack.org/nova/latest/admin/aggregates.html){:target="_blank"} are a way of grouping hosts in an OpenStack cloud. This allows you to create groups of certain types of hosts and then steer certain classes of VM instances to them. +Host Aggregates[^1] are a mechanism for partitioning hosts in an OpenStack cloud, or a [Region](openstack-cloud-design-regions.md) of an OpenStack cloud, based on arbitrary characteristics. Examples where an administrator may want to do this include where a group of hosts have additional hardware or performance characteristics. -## Designing Host Aggregates in OpenStack +![Host Aggregates](assets/images/Host-Aggregates.png) -!!! info "To Do" +Each node can have multiple aggregates, each aggregate can have multiple key-value pairs, and the same key-value pair can be assigned to multiple aggregates. This information can be used in the scheduler to enable advanced scheduling or to define logical groups for migration. In general, Host Aggregates can be thought of as a way to segregate compute resources _behind the scenes_ to control and influence where VM instances will be placed. - Describe how to implement Host Aggregates using the following OpenStack Services: +## Host Aggregates in Nova - - Nova - - Placement - - Glance +Host aggregates are not explicitly exposed to users. Instead administrators map flavors to host aggregates. Administrators do this by setting metadata on a host aggregate, and setting matching flavor extra specifications. The scheduler then endeavors to match user requests for instances of the given flavor to a host aggregate with the same key-value pair in its metadata. Hosts can belong to multiple Host Aggregates, depending on the attributes being used to define the aggregate. -... +A common use case for host aggregates is when you want to support scheduling instances to a subset of compute hosts because they have a specific capability. For example, you may want to allow users to request compute hosts that have NVMe drives if they need access to faster disk I/O, or access to compute hosts that have GPU cards to take advantage of GPU-accelerated code. Examples include: + +- Hosts with GPU compute resources +- Hosts with different local storage capabilities (e.g. SSD vs NVMe) +- Different CPU manufacturers (e.g. AMD vs Intel) +- Different CPU microarchitectures (e.g. Skylake vs Raptor Cove) + +![Multiple Host Aggregates](assets/images/Multiple-Host-Aggregates.png) + +### Host Aggregates vs. Availability Zones + +While Host Aggregates themselves are hidden from OpenStack cloud users, Cloud administrators are able to optionally expose a host aggregate as an [Availability Zone](openstack-cloud-design-az.md). Availability zones differ from host aggregates in that they are explicitly exposed to the user, and hosts membership is exclusive -- hosts can only be in a single availability zone. + +!!! Warning + It is not allowed to move instances between Availability Zones. If adding a host to an aggregate or removing a host from an aggregate would cause an instance to move between Availability Zones (including moving from or moving to the default AZ) then the operation will be fail. + +### Host Aggregates in Genestack + +!!! Genestack + Genestack is designed to use [Host Aggregates](openstack-host-aggregates.md) to take advantage of various compute host types. + +## Aggregates and Placement + +The [Placement](https://docs.openstack.org/placement/latest/){:target="_blank"} service also has a concept of [Aggregates](https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/alloc-candidates-member-of.html). However, these are not the same thing as Host Aggregates in Nova. Placement Aggregates are defined purely as groupings of related resource providers. As compute nodes in Nova are represented in Placement as resource providers, they can be added to a Placement Aggregate as well. + +## Host Aggregates and Glance + +The primary way that Glance can influence placement and work with Host Aggregates is via [Metadata](https://docs.openstack.org/glance/latest/user/metadefs-concepts.html){:target="_blank"}. + +You can map flavors and images to Host Aggregates by setting metadata on the Host Aggregate, and then set Glance image metadata properties to correlate to the host aggregate metadata. Placement can then use this metadata to schedule instances when the required filters are enabled. + +!!! Note + Metadata that you specify in a Host Aggregate limits the use of that host to any instance that has the same metadata specified in its flavor or image. + +[^1]: Host aggregates started out as a way to use Xen hypervisor resource pools, but have since been generalized to provide a mechanism to allow administrators to assign key-value pairs to groups of machines. diff --git a/docs/openstack-cloud-design-intro.md b/docs/openstack-cloud-design-intro.md new file mode 100644 index 00000000..3602b717 --- /dev/null +++ b/docs/openstack-cloud-design-intro.md @@ -0,0 +1,14 @@ +# Introduction + +![OpenStack](assets/images/OpenStack-Logo-Vertical.svg){align=right : style="max-width:250px"} + +## Welcome to Genestack + +[Genestack](index.md) is a new way of deploying [OpenStack](https://openstack.org){:target="\_blank"} and [Kubernetes](kttps://k8s.io){:target="\_blank"} together to create a new type of unified cloud infrastructure. Before diving into the Genestack details and inner-workings, it will be important to decide how you want to structure your cloud. + +![Kubernetes](assets/images/kubernetes-stacked-color.svg){align=right : style="max-width:250px"} + +This section of the documentation covers some basic OpenStack cloud design principles, and offers insight into how these can be realized when building a solution using OpenStack and Genestack. + +!!! Genestack + Watch for **Genestack** boxes like this one to show where various design decisions, technologies, or ideas that Genestack is using! diff --git a/docs/openstack-cloud-design-regions.md b/docs/openstack-cloud-design-regions.md index 9e2f9438..8f0cc909 100644 --- a/docs/openstack-cloud-design-regions.md +++ b/docs/openstack-cloud-design-regions.md @@ -69,8 +69,8 @@ In most cases, [Cinder](https://docs.openstack.org/cinder/latest/){:target="_bla As with Neutron, the key is designing services that can be put together with other building blocks to create the useful combinations that cloud users are looking to take advantage of. For Cinder, this usually means some kind of cross-region replication. -Currently, Cinder [replication](https://docs.openstack.org/cinder/latest/contributor/replication.html){:target="_blank"} is limited to in-region backend failure scenarios where volumes can be -saved to multiple backends. +!!! Note + Currently, Cinder [replication](https://docs.openstack.org/cinder/latest/contributor/replication.html){:target="_blank"} is limited to in-region backend failure scenarios where volumes can be saved to multiple backends. Replicating Cinder volumes from one Region to another is more complicated in the sense that not only does the actual volume storage need to be replicated, but both regions would need to have the metadata in sync for those volumes. Ultimately, there would need to be a way to synchronize the _state_ of those volumes so that both Regions understand the local and the remote to be the _same volume_. This is much more complex. diff --git a/docs/openstack-cloud-design-overview.md b/docs/openstack-cloud-design-topology.md similarity index 94% rename from docs/openstack-cloud-design-overview.md rename to docs/openstack-cloud-design-topology.md index 7620748d..1683e8ea 100644 --- a/docs/openstack-cloud-design-overview.md +++ b/docs/openstack-cloud-design-topology.md @@ -1,4 +1,4 @@ -# Overview +# Cloud Topology When building a cloud, the design of the cloud will not only reflect the type of services you are looking to provide but also inform the way that users of your cloud work to accomplish their goals. @@ -35,7 +35,7 @@ The major disadvantage of this pattern is that any replication of data or servic Regions can be defined as a high-level cloud construct where there is some sort of geographical separation of services. In most cases, there may be some supporting services are shared between them; however, the general principle is that each region should be as self-sufficient as possible. -Depending on the scale of the cloud being built, geographical separation could be completely separate datacenters for very large clouds, or it could just mean separate data halls for smaller clouds. Regardless of the scale of separation, the operational resources should be kept as independent as possible with respect to power, networking, cooling, etc. +Depending on the scale of the cloud being built, geographical separation could be completely separate data centers for very large clouds, or it could just mean separate data halls for smaller clouds. Regardless of the scale of separation, the operational resources should be kept as independent as possible with respect to power, networking, cooling, etc. In addition to the physical geographical separation, there also needs to be logical separation of most services. Most storage, compute, and execution services[^2] should be separate and have the ability to operate independently from those same services deployed in other regions. This is key to ensure that users can depend on any region-level failure to be isolated and not bleed into other regions and harm the availability of their deployments. Any kind of fault or failure at the Region-level should be able to be mitigated by the cloud user having deployments in multiple regions to provide high-availability. diff --git a/mkdocs.yml b/mkdocs.yml index 0bcdca03..32ab9cf9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -133,11 +133,13 @@ nav: - Architecture: genestack-architecture.md - Components: genestack-components.md - Swift Object Storage: openstack-object-storage-swift.md - - Cloud Design: - - Overview: openstack-cloud-design-overview.md - - Regions in OpenStack: openstack-cloud-design-regions.md - - Availability Zones in OpenStack: openstack-cloud-design-az.md - - Host Aggregates in OpenStack: openstack-cloud-design-ha.md + - Design Guide: + - Introduction: openstack-cloud-design-intro.md + - Cloud Design: + - Cloud Topology: openstack-cloud-design-topology.md + - Regions: openstack-cloud-design-regions.md + - Availability Zones: openstack-cloud-design-az.md + - Host Aggregates: openstack-cloud-design-ha.md - Accelerated Computing: - Overview: accelerated-computing-overview.md - Infrastructure: accelerated-computing-infrastructure.md