Skip to content

Commit

Permalink
[DPE-4656] add TLS CA rotation routine (#353)
Browse files Browse the repository at this point in the history
When a new TLS certificate authority (CA) certificate is issued, the
opensearch-operator should add this new CA to all its units and request
new certificates. The new certificates (including the CA certificate)
should be distributed to all OpenSearch nodes in a rolling restart
manner, without downtime to the entire cluster.

Due to limitations on the self-signed-certificates operator it is not
possible to:
- get a notice if a CA certificate is about to expire
- request a new CA when the current one is about to or has expired
- request an intermediate CA and sign future certificates with it

There is currently no support for renewing a root / CA certificate on
the self-signed-certificates operator. A new root / CA certificate will
only be generated and issued if the common_name of the CA changes.

We have decided to implement the logic in that way that we check each
certificate if it includes a new CA. If so, we store the new CA and
initiate the CA rotation workflow on OpenSearch.

This PR implements the following workflow:
- check each `CertificateAvailableEvent` if it includes a new CA
- add the new CA to the truststore
- add a notice `tls_ca_renewing` to the unit's peer data
- initiate a restart of OpenSearch (using the locking mechanism to
coordinate cluster availability during the restart)
- after restarting, add a notice `tls_ca_renewed` to the unit's peer
data
- when the restart is done on all of the cluster nodes, request new TLS
certificates and apply them to the node

During the phase of renewing the CA, all incoming
`CertificateAvailableEvents` will be deferred in order to avoid
incompatibilites in communication between the nodes.

Please also see the flow of events and actions that has been documented
here:
https://github.com/canonical/opensearch-operator/wiki/TLS-CA-rotation-flow

- There is a dependency to
#367 because during
the rolling restart when the CA is rotated it is very likely that the
voting exclusion issue shows up (at least in 3-node-clusters). Therefore
the integration test is currently running only with two nodes. Once the
voting exclusions issue is resolved, this can be updated to the usual
three nodes.
- Due to an upstream bug with JDK it is necessary to use TLS v1.2 (more
details see opensearch-project/security#3299).
- This PR introduces a method to append configuration to the jvm options
file of OpenSearch (used to set TLS config to v1.2).

---------

Co-authored-by: Mehdi Bendriss <[email protected]>
Co-authored-by: Judit Novak <[email protected]>
  • Loading branch information
3 people authored and skourta committed Sep 18, 2024
1 parent 9230edb commit f1a42ab
Show file tree
Hide file tree
Showing 17 changed files with 1,874 additions and 119 deletions.
1 change: 1 addition & 0 deletions lib/charms/opensearch/v0/constants_charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@
SecurityIndexInitProgress = "Initializing the security index..."
AdminUserInitProgress = "Configuring admin user..."
TLSNewCertsRequested = "Requesting new TLS certificates..."
TLSCaRotation = "Applying new CA certificate..."
HorizontalScaleUpSuggest = "Horizontal scale up advised: {} shards unassigned."
WaitingForOtherUnitServiceOps = "Waiting for other units to complete the ops on their service."
NewIndexRequested = "new index {index} requested"
Expand Down
34 changes: 34 additions & 0 deletions lib/charms/opensearch/v0/helper_conf_setter.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,20 @@ def replace(
"""
pass

@abstractmethod
def append(
self,
config_file: str,
text_to_append: str,
) -> None:
"""Append any string to a text file.
Args:
config_file (str): Path to the source config file
text_to_append (str): The str to append to the config file
"""
pass

@staticmethod
def __clean_base_path(base_path: str):
if base_path is None:
Expand Down Expand Up @@ -283,6 +297,26 @@ def replace(
with open(output_file, "w") as g:
g.write(data)

@override
def append(
self,
config_file: str,
text_to_append: str,
) -> None:
"""Append any string to a text file.
Args:
config_file (str): Path to the source config file
text_to_append (str): The str to append to the config file
"""
path = f"{self.base_path}{config_file}"

if not exists(path):
raise FileNotFoundError(f"{path} not found.")

with open(path, "a") as f:
f.write("\n" + text_to_append)

def __dump(self, data: Dict[str, any], output_type: OutputType, target_file: str):
"""Write the YAML data on the corresponding "output_type" stream."""
if not data:
Expand Down
98 changes: 73 additions & 25 deletions lib/charms/opensearch/v0/opensearch_base_charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,14 @@
ServiceIsStopping,
ServiceStartError,
ServiceStopped,
TLSCaRotation,
TLSNewCertsRequested,
TLSNotFullyConfigured,
TLSRelationBrokenError,
TLSRelationMissing,
WaitingToStart,
)
from charms.opensearch.v0.constants_tls import TLS_RELATION, CertType
from charms.opensearch.v0.constants_tls import CertType
from charms.opensearch.v0.helper_charm import Status, all_units, format_unit_name
from charms.opensearch.v0.helper_cluster import ClusterTopology, Node
from charms.opensearch.v0.helper_networking import get_host_ip, units_ips
Expand Down Expand Up @@ -188,7 +189,7 @@ def __init__(self, *args, distro: Type[OpenSearchDistribution] = None):
self.peers_data = RelationDataStore(self, PeerRelationName)
self.secrets = OpenSearchSecrets(self, PeerRelationName)
self.tls = OpenSearchTLS(
self, TLS_RELATION, self.opensearch.paths.jdk, self.opensearch.paths.certs
self, PeerRelationName, self.opensearch.paths.jdk, self.opensearch.paths.certs
)
self.status = Status(self)
self.health = OpenSearchHealth(self)
Expand Down Expand Up @@ -421,9 +422,6 @@ def _on_peer_relation_created(self, event: RelationCreatedEvent):
"Adding units during an upgrade is not supported. The charm may be in a broken, unrecoverable state"
)

# Store the "Admin" certificate, key and CA on the disk of the new unit
self.tls.store_admin_tls_secrets_if_applies()

def _on_peer_relation_joined(self, event: RelationJoinedEvent):
"""Event received by all units when a new node joins the cluster."""
if self.upgrade_in_progress:
Expand All @@ -433,8 +431,6 @@ def _on_peer_relation_joined(self, event: RelationJoinedEvent):

def _on_peer_relation_changed(self, event: RelationChangedEvent):
"""Handle peer relation changes."""
self.tls.store_admin_tls_secrets_if_applies()

if self.unit.is_leader() and self.opensearch.is_node_up():
health = self.health.apply()
if self._is_peer_rel_changed_deferred:
Expand Down Expand Up @@ -555,7 +551,7 @@ def _on_opensearch_data_storage_detaching(self, _: StorageDetachingEvent): # no
# release lock
self.node_lock.release()

def _on_update_status(self, event: UpdateStatusEvent):
def _on_update_status(self, event: UpdateStatusEvent): # noqa: C901
"""On update status event.
We want to periodically check for the following:
Expand Down Expand Up @@ -735,6 +731,11 @@ def _on_get_password_action(self, event: ActionEvent):
}
)

def on_tls_ca_rotation(self):
"""Called when adding new CA to the trust store."""
self.status.set(MaintenanceStatus(TLSCaRotation))
self._restart_opensearch_event.emit()

def on_tls_conf_set(
self, event: CertificateAvailableEvent, scope: Scope, cert_type: CertType, renewal: bool
):
Expand Down Expand Up @@ -767,12 +768,24 @@ def on_tls_conf_set(
self.tls.store_admin_tls_secrets_if_applies()

# In case of renewal of the unit transport layer cert - restart opensearch
if renewal and self.is_admin_user_configured() and self.tls.is_fully_configured():
try:
self.tls.reload_tls_certificates()
except OpenSearchHttpError:
logger.error("Could not reload TLS certificates via API, will restart.")
self._restart_opensearch_event.emit()
if renewal and self.is_admin_user_configured():
if self.tls.is_fully_configured():
try:
self.tls.reload_tls_certificates()
except OpenSearchHttpError:
logger.error("Could not reload TLS certificates via API, will restart.")
self._restart_opensearch_event.emit()
self.tls.reset_ca_rotation_state()
self.status.clear(TLSNotFullyConfigured)
# the chain.pem file should only be updated after applying the new certs
# otherwise there could be TLS verification errors after renewing the CA
self.tls.update_request_ca_bundle()
# cleaning the former CA certificate from the truststore
# must only be done AFTER all renewed certificates are available and loaded
self.tls.remove_old_ca()
else:
event.defer()
return

def on_tls_relation_broken(self, _: RelationBrokenEvent):
"""As long as all certificates are produced, we don't do anything."""
Expand Down Expand Up @@ -804,6 +817,18 @@ def is_every_unit_marked_as_started(self) -> bool:
except OpenSearchHttpError:
return False

def is_tls_full_configured_in_cluster(self) -> bool:
"""Check if TLS is configured in all the units of the current cluster."""
rel = self.model.get_relation(PeerRelationName)
for unit in all_units(self):
if (
rel.data[unit].get("tls_configured") != "True"
or "tls_ca_renewing" in rel.data[unit]
or "tls_ca_renewed" in rel.data[unit]
):
return False
return True

def is_admin_user_configured(self) -> bool:
"""Check if admin user configured."""
# In case the initialisation of the admin user is not finished yet
Expand Down Expand Up @@ -862,18 +887,17 @@ def _start_opensearch(self, event: _StartOpenSearch) -> None: # noqa: C901

self.peers_data.delete(Scope.UNIT, "started")

if not self.node_lock.acquired:
# (Attempt to acquire lock even if `event.ignore_lock`)
if event.ignore_lock:
# Only used for force upgrades
logger.debug("Starting without lock")
else:
logger.debug("Lock to start opensearch not acquired. Will retry next event")
event.defer()
return
if event.ignore_lock:
# Only used for force upgrades
logger.debug("Starting without lock")
elif not self.node_lock.acquired:
logger.debug("Lock to start opensearch not acquired. Will retry next event")
event.defer()
return

if not self._can_service_start():
self.node_lock.release()
logger.info("Could not start opensearch service. Will retry next event.")
event.defer()
return

Expand Down Expand Up @@ -913,8 +937,13 @@ def _start_opensearch(self, event: _StartOpenSearch) -> None: # noqa: C901
)
)
self._post_start_init(event)
except (OpenSearchHttpError, OpenSearchStartTimeoutError, OpenSearchNotFullyReadyError):
except (
OpenSearchHttpError,
OpenSearchStartTimeoutError,
OpenSearchNotFullyReadyError,
) as e:
event.defer()
logger.warning(e)
except (OpenSearchStartError, OpenSearchUserMgmtError) as e:
logger.warning(e)
self.node_lock.release()
Expand All @@ -940,7 +969,7 @@ def _post_start_init(self, event: _StartOpenSearch): # noqa: C901
try:
nodes = self._get_nodes(use_localhost=self.opensearch.is_node_up())
except OpenSearchHttpError:
logger.debug("Failed to get online nodes")
logger.info("Failed to get online nodes")
event.defer()
return

Expand Down Expand Up @@ -991,6 +1020,7 @@ def _post_start_init(self, event: _StartOpenSearch): # noqa: C901

# clear waiting to start status
self.status.clear(WaitingToStart)
self.status.clear(ServiceStartError)

if event.after_upgrade:
health = self.health.get(local_app_only=False, wait_for_green_first=True)
Expand Down Expand Up @@ -1047,6 +1077,22 @@ def _post_start_init(self, event: _StartOpenSearch): # noqa: C901
if self.opensearch_peer_cm.is_provider():
self.peer_cluster_provider.refresh_relation_data(event, can_defer=False)

# update the peer relation data for TLS CA rotation routine
self.tls.reset_ca_rotation_state()
if self.is_tls_full_configured_in_cluster():
self.status.clear(TLSCaRotation)

# request new certificates after rotating the CA
if self.peers_data.get(Scope.UNIT, "tls_ca_renewing", False) and self.peers_data.get(
Scope.UNIT, "tls_ca_renewed", False
):
self.status.set(MaintenanceStatus(TLSNotFullyConfigured))
self.tls.request_new_unit_certificates()
if self.unit.is_leader():
self.tls.request_new_admin_certificate()
else:
self.tls.store_admin_tls_secrets_if_applies()

def _stop_opensearch(self, *, restart=False) -> None:
"""Stop OpenSearch if possible."""
self.status.set(WaitingStatus(ServiceIsStopping))
Expand Down Expand Up @@ -1091,7 +1137,9 @@ def _restart_opensearch(self, event: _RestartOpenSearch) -> None:

try:
self._stop_opensearch(restart=True)
logger.info("Restarting OpenSearch.")
except OpenSearchStopError as e:
logger.info(f"Error while Restarting Opensearch: {e}")
logger.exception(e)
self.node_lock.release()
event.defer()
Expand Down
22 changes: 16 additions & 6 deletions lib/charms/opensearch/v0/opensearch_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,11 @@ def set_client_auth(self):
True,
)

self._opensearch.config.append(
self.JVM_OPTIONS,
"-Djdk.tls.client.protocols=TLSv1.2",
)

def set_admin_tls_conf(self, secrets: Dict[str, any]):
"""Configures the admin certificate."""
self._opensearch.config.put(
Expand All @@ -89,12 +94,11 @@ def set_node_tls_conf(self, cert_type: CertType, truststore_pwd: str, keystore_p
f"{self._opensearch.paths.certs_relative}/{cert if cert == 'ca' else cert_type}.p12",
)

for store_type, certificate_type in [("keystore", cert_type.val), ("truststore", "ca")]:
self._opensearch.config.put(
self.CONFIG_YML,
f"plugins.security.ssl.{target_conf_layer}.{store_type}_alias",
certificate_type,
)
self._opensearch.config.put(
self.CONFIG_YML,
f"plugins.security.ssl.{target_conf_layer}.keystore_alias",
cert_type.val,
)

for store_type, pwd in [("keystore", keystore_pwd), ("truststore", truststore_pwd)]:
self._opensearch.config.put(
Expand All @@ -103,6 +107,12 @@ def set_node_tls_conf(self, cert_type: CertType, truststore_pwd: str, keystore_p
pwd,
)

self._opensearch.config.put(
self.CONFIG_YML,
f"plugins.security.ssl.{target_conf_layer}.enabled_protocols",
"TLSv1.2",
)

def append_transport_node(self, ip_pattern_entries: List[str], append: bool = True):
"""Set the IP address of the new unit in nodes_dn."""
if not append:
Expand Down
3 changes: 2 additions & 1 deletion lib/charms/opensearch/v0/opensearch_distro.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,8 @@ def is_node_up(self, host: Optional[str] = None) -> bool:
timeout=1,
)
return resp_code < 400
except (OpenSearchHttpError, Exception):
except (OpenSearchHttpError, Exception) as e:
logger.debug(f"Error when checking if host {host} is up: {e}")
return False

def run_bin(self, bin_script_name: str, args: str = None, stdin: str = None) -> str:
Expand Down
6 changes: 6 additions & 0 deletions lib/charms/opensearch/v0/opensearch_relation_peer_cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -597,6 +597,7 @@ def _set_security_conf(self, data: PeerClusterRelData) -> None:

# store the app admin TLS resources if not stored
self.charm.tls.store_new_tls_resources(CertType.APP_ADMIN, data.credentials.admin_tls)
self.charm.tls.update_request_ca_bundle()

# set user and security_index initialized flags
self.charm.peers_data.put(Scope.APP, "admin_user_initialized", True)
Expand Down Expand Up @@ -872,6 +873,11 @@ def _error_set_from_tls(self, peer_cluster_rel_data: PeerClusterRelData) -> bool
blocked_msg = "CA certificate mismatch between clusters."
should_sever_relation = True

if not peer_cluster_rel_data.credentials.admin_tls["truststore-password"]:
logger.info("Relation data for TLS is missing.")
blocked_msg = "CA truststore-password not available."
should_sever_relation = True

if not blocked_msg:
self._clear_errors("error_from_tls")
return False
Expand Down
8 changes: 1 addition & 7 deletions lib/charms/opensearch/v0/opensearch_secrets.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,13 +114,7 @@ def _on_secret_changed(self, event: SecretChangedEvent): # noqa: C901

logger.debug("Secret change for %s", str(label_key))

# Leader has to maintain TLS and Dashboards relation credentials
if not is_leader and label_key == CertType.APP_ADMIN.val:
self._charm.tls.store_new_tls_resources(CertType.APP_ADMIN, event.secret.get_content())
if self._charm.tls.is_fully_configured():
self._charm.peers_data.put(Scope.UNIT, "tls_configured", True)

elif is_leader and label_key == self._charm.secrets.password_key(KibanaserverUser):
if is_leader and label_key == self._charm.secrets.password_key(KibanaserverUser):
self._charm.opensearch_provider.update_dashboards_password()

# Non-leader units need to maintain local users in internal_users.yml
Expand Down
Loading

0 comments on commit f1a42ab

Please sign in to comment.