Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Archive and Restore #2250

Closed
briantopping opened this issue Aug 9, 2020 · 14 comments
Closed

Node Archive and Restore #2250

briantopping opened this issue Aug 9, 2020 · 14 comments
Labels
area/UX kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Milestone

Comments

@briantopping
Copy link

FEATURE REQUEST

Preface: There is no substitute for proper backup and restore hygiene.

This feature request is for cluster backup and restore functionality as a part of kubeadm. Cluster deployment tools have unique knowledge of their behaviors and files that are both common and unique to a cluster. In disaster recovery situations, time is of the essence, and a faster automated recovery can be very valuable. While file-by-file backups of an OS root are feasible, efficiencies can be gained with cloud-init based images if the archiver can cherry-pick only the application level files necessary to restore. It is dangerous to expect users will track the changes of deployment layouts over time and quite simple if they know they can script a tool that will know how to do so.

Implementation is envisioned in three parts:

  1. kubeadm archive generates an archive that could be used by the restore functionality. It would generate to a file specified on the command line, a generated filename in /tmp or with storage providers such as S3.
  2. kubeadm restore would restore a node to the condition it was in at the time of an archive generated above.
  3. kubeadm reset would be modified to create an an archive by default.

In all cases, of archive, restore should match the current behavior of kubeadm init / kubeadm join:

  • A simple single-master cluster with no host resources would capture an etcd snapshot in the archive generated indirectly by kubeadm reset. So long as resources set up during the installation did not change (ip addresses, CRI, etc), kubeadm restore could return the cluster to its previous state.
  • The same would generally be true for a HA cluster - the restored node would re-peer to its previous peers since identity contained in /etc/kubernetes would allow it. Optional / future logic might recognize those peers are missing or damaged and allow a cluster archive to be hydrated without the peers and to a single-node etcd (utilizing the etcd snapshot). It is a non-requirement that a modified HA cluster (ie kubectl remove node foo after archive of foo) would allow the archived node to re-join.
  • A restore where destination directories already exist should fail without making any changes.

It is important to recognize that node-specific resources must be intact for a restore to be successful. A Local Persistent Volume is an excellent example of this, but it holds true for devices that might be attached by Rook, interface names or addresses, local hostname configuration, etc.

Use cases

  • Inadvertent kubeadm reset is reversible if archive generation is not explicitly disabled
  • Backup tools can trigger the generation of an archive to store in monolithic node backups
  • DR of a node could be reliably scripted from backups in a future-proof manner
  • Nodes could be built with cloud-init images and reliably regenerated from archives
@neolit123
Copy link
Member

hi @briantopping ,

kubeadm's primary goal is to be a k8s cluster bootstrapper that creates a minimal viable cluster.
everything else we can consider out-of-scope and can be delegated to an external tool (ALA the Unix philosophy)

with respect to this FR, kubeadm is already overstepping the Unix philosophy in a couple of places:

  • the command "kubeadm reset" was added to reset the changes that kubeadm made on node in case of mutable infrastructure.
  • "kubeadm upgrade" already does partial backup and restore operation. it takes etcd snapshots and restores control-plane component manifests in case the control plane fails post upgrade. this felt right because "upgrade" can be disruptive.

big changes in kubeadm require a KEP:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-cluster-lifecycle/kubeadm
https://github.com/kubernetes/enhancements/tree/master/keps#kubernetes-enhancement-proposals-keps
but before you do that, i'd wait to get some +1s from the maintainers.

my initial comments here are the following:

  • is this more suitable as a feature of the kubeadm operator?

  • what falls in scope of a "node archive"?
    if someone is running a mandatory privileged static pod on their nodes that interact with a hostPath, how would the new kubeadm command know to backup that hostPath too?

  • this can be problematic:

    A restore where destination directories already exist should fail without making any changes.

    what about /var/lib/kubelet ? i'd expect a proper node backup to also backup this directory including the kubelet certificates,
    but then does that mean that a "restore" can only happen if a kubelet is not running on the node?

@kubernetes/sig-cluster-lifecycle
/kind feature
/area ux
/priority awaiting-more-evidence

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/UX priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Aug 10, 2020
@briantopping
Copy link
Author

briantopping commented Aug 10, 2020

Hi @neolit123, thanks for the comprehensive feedback. I agree with your positions here. Taking on inappropriate scope is a good way to break a project. Nobody wants that.

In keeping with your ideas, I believe this FR is in scope, but only because the implementation of the feature outside of kubeadm is still version locked to changes that occur in kubeadm over time. There is no doubt that an in-tree feature would add to the weight of a release, but users would be sure that for any given release and deployment, disaster recovery operations were dependable. This is also very UNIXy, the external tool that we depend on to know how to back up a kubeadm-generated cluster is actually the kubeadm present on the node.

I think this perspective can be validated by your comment about /var/lib/kubelet. The maintainers of kubeadm are best equipped to recognize these nuances over time, especially regarding structural changes that must be captured in an archive. As a hypothetical out-of-tree project, there would be an implementation gap and require exhaustive examination of every commit to kubeadm and/or a lot of resource intensive queries over time.

In light of your feedback, would you agree the determination of whether "dependable archive functionality is valuable" should take place before a consideration of the packaging (operator, krew, in-tree kubeadm, etc)? Captured as a KEP would be a good exercise in filtering broken assumptions, refining requirements as well as gauging value. Any one of those three could fail the effort. If the effort gains momentum, we should have a better idea by that point what the packaging should look like and why.

@neolit123
Copy link
Member

neolit123 commented Aug 10, 2020

I think this perspective can be validated by your comment about /var/lib/kubelet. The maintainers of kubeadm are best equipped to recognize these nuances over time, especially regarding structural changes that must be captured in an archive. As a hypothetical out-of-tree project, there would be an implementation gap and require exhaustive examination of every commit to kubeadm and/or a lot of resource intensive queries over time.

an important aspect here is that directories like /var/lib/kubelet or /var/lib/etcd are not really maintained by kubeadm, they are maintained by the kubelet and etcd. while kubeadm would be closer to knowing what is in there than any external tool, it would face the same issues of making maintainers look at the source code of kubelet/etcd determining "what changed".

on the other hand just always archiving their contents might no be desired.

In light of your feedback, would you agree the determination of whether "dependable archive functionality is valuable" should take place before a consideration of the packaging (operator, krew, in-tree kubeadm, etc)?

it feels like supporting detailed backup/archive would require user level configuration - e.g. making it possible to enumerate paths and skip sub-paths. the default paths to be archived become debatable and could trigger a number of change requests by users...

In light of your feedback, would you agree the determination of whether "dependable archive functionality is valuable" should take place before a consideration of the packaging (operator, krew, in-tree kubeadm, etc)?

i don't think many would argue against the value of backup/restore, i'm more interested in enumerating the locations that would be backed up, restored and the level of customization the users will have and where would this tooling live.

Captured as a KEP would be a good exercise in filtering broken assumptions, refining requirements as well as gauging value. Any one of those three could fail the effort. If the effort gains momentum, we should have a better idea by that point what the packaging should look like and why.

there are multiple levels to KEPs. for example here is the place for generic sig-cluster-lifecycle KEPs:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-cluster-lifecycle/generic

here are kubeadm specific KEPs:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-cluster-lifecycle/kubeadm

but like i said, ideally this FR should get some +1 comments from the maintainers, before going into KEP form.
you could start by writing a google doc with your thoughts if you'd prefer.

@fabriziopandini
Copy link
Member

I'm really debated about this feature.

From one side I understand the user's need for a something that takes charge of the whole problem.
From the other side, I don't think that kubeadm is the right tool for this job, especially considering that most of the complexity here is due to the etcd management and to a bazillion of knobs that goes far and behind of the kubeadm responsibility (e.g. additional static pods, other os level settings & utilities, os/distro differences etc.)

Other things that makes me lean to a -1 to get this in kubeam are:

  • backup and restore is a really an opinionated space and there will be people arguing we should not backup & restore etcd at file level but instead act at k8s resource level (see git ops or tools like Valero)
  • similarly, there will be people arguing that we should not backup node settings but simply re-create the node; see cattle vs pets or mutable vs immutable nodes discussions.

IMO this feature - o more generically an HA/DR plan - should be part of higher-level tools in the stack like Cluster API, kubespray, Kops because those tools have control of the full stack, thus allowing to clearly define the scope to account for.

@briantopping
Copy link
Author

I'm okay with this not being a kubeadm feature and I don't want to waste valuable team time if this is the wrong team for it. Super grateful for the input so far and pleased it's not a solid -1 off the bat.

That said, I can't imagine how many people have said "OH FSCK!" as they acknowledged kubeadm reset in the wrong terminal window. I've done it twice now and lost a month of work to it the first time, just starting my second time now, but at least have a copy of /etc/kubernetes from the last surviving node. I don't think kubeadm reset should be removed, but at the very least should refuse to reset a node when doing so would cause an HA cluster to become unresponsive. I don't know what to say about the single node control plane case, but I think it's important too. Ceph has some annoying flags like "--yes-i-really-mean-it" that are initially really annoying and one comes to respect over time.

Anyway, as I thought through the problem, I started to realize that a minimal (non-transitive) archive of the content deleted by kubeadm reset would have solved the problem tidily. But was it realistic to stop there? Shouldn't that archive be usable for more than just inadvertent resets by people who should be sleeping instead of hacking?

And that's where the thought process strayed to how the "knobs and dependencies" in kubeadm are very different than other deployers. Kubeadm knows that it installed stacked vs external etcd, a consideration that is important to getting a transactionally stable archive. Kops is going to have some AWS config. Is it good scope for the installers to create that stable archive of minimum viable cluster as normative functionality?

I do appreciate that fully restoring a node is a transitive closure problem, and I didn't mean to imply this archive should do that, at all. It would be impossible to track CRs that create local resources, for instance. It started as above, no more, no less.

Last thing I will add since Kubespray and Kops are mentioned: There's a user base that runs complex clusters on bare metal and kubeadm is the most reliable tool to do that with. If it feels to the team that it's a MVP reference bootstrapper, community perspective might surprise you. It's really the best thing going and is an essential peer of Kubespray and Kops. So something that they should be doing, kubeadm should arguably also be doing. And I get that might mean the Cluster API should be facilitating that, removing that responsibility from installers instead of having each of them rebuild the same functionality. One way to do that would be to leave ConfigMap objects in scope so the Cluster API could provide this archive-like functionality across all installs.

@fabriziopandini
Copy link
Member

@briantopping those are valuable comments. thanks
I will try to get the best out of it.

The first action item is to consider if we can add an additional sanity check on top of kubeadm reset, asking for a second confirmation in case the action is potentially destructive for the cluster (reset of a control plane node)

@neolit123
Copy link
Member

neolit123 commented Aug 12, 2020

That said, I can't imagine how many people have said "OH FSCK!" as they acknowledged kubeadm reset in the wrong terminal window. I've done it twice now and lost a month of work to it the first time, just starting my second time now, but at least have a copy of /etc/kubernetes from the last surviving node.

one might as well call rm -rf ./ by mistake which is also dangerous and skips confirmation.

I don't think kubeadm reset should be removed, but at the very least should refuse to reset a node when doing so would cause an HA cluster to become unresponsive. I don't know what to say about the single node control plane case, but I think it's important too.

for the minimal / recommended number of CP nodes in a HA cluster - 3, removing 1 node falls under the accepted failure tolerance of etcd:
https://etcd.io/docs/v3.3.12/faq/

thus, for an HA CP it would be easier to recover from a kubeadm reset mistake. for a single CP node scenario, that would be harder, then again that is why HA is recommended.

Ceph has some annoying flags like "--yes-i-really-mean-it" that are initially really annoying and one comes to respect over time.

kubeadm reset has a confirmation prompt, accepting yes/no from the user. -f overrides that.

[reset] Are you sure you want to proceed? [y/N]:

i don't think we should be adding yet another one.

Anyway, as I thought through the problem, I started to realize that a minimal (non-transitive) archive of the content deleted by kubeadm reset would have solved the problem tidily. But was it realistic to stop there? Shouldn't that archive be usable for more than just inadvertent resets by people who should be sleeping instead of hacking?

i wouldn't call executing the following simple tar command hacking, though:

sudo tar -cf ~/k8s-archives/somearchive.tar /etc/kubernetes /var/lib/kubelet /var/lib/etcd

it would be the equivalent of a best effort kubeadm archive command that we can design.

there are other caveats around this feature.
a proper archive requires shutting down the etcd server and kubelet instances on the node to prevent archiving in between file writes. then, if etcd is archived on a CP node, all other CP nodes must use the same etcd snapshot and not archive/restore their own.

another problem here is archiving at a time the kubelet is rotating it's client/serving certificates. if the archive takes an expired certificate then at kubelet restart the kubelet will fail to authn with the api-server resulting in a requirement for the node to re-bootstrap using new credentials (token/certs). i.e. admin intervention.

these are reasons adding to the argument that this should be a responsibility higher up the stack, while the customizable low level archival mechanism itself could be anything (e.g. tar).

@neolit123 neolit123 added this to the v1.20 milestone Aug 12, 2020
@briantopping
Copy link
Author

one might as well call rm -rf ./ by mistake which is also dangerous and skips confirmation.

Oh, is that the problem? 😂🤣👍🎉

The problem is rm -rf ./ not a command anyone uses. kubeadm reset is a very common command for bare metal admins. Being in the wrong window after running a half dozen kubectl commands on a working cluster to figure out tokens that are needed is a common situation. These two are toxic together. There shouldn't even be a warning if we aren't trying to protect users and their clusters. Just wipe it out already.

Sometimes I have trouble parsing whether you are serious or having fun being snarky. I'm trying to help out here. I'm not some dumbshit and we're both too busy to waste time with why something is impossible. Others who aren't quite as comfortable admitting their failures are probably also running into the problem. I think I know you like to have fun and are way smarter than me on a most of this stuff from our last interaction, but I want to be careful. Hopefully that's 'nuf said. 🙏

Your excellent list of issues just bolsters the need for this FR. It's an improvable cluster management experience to have to know all this to own a kubeadm-based cluster, and it likely changes over time. A lot of k8s users are generalists and have a much wider scope than just k8s knowledge. How can we empower them to work confidently and efficiently on clusters? Crashing clusters is "not efficient" and will push people away when they might have actually been just fine without that one issue...

@neolit123
Copy link
Member

...having fun being snarky.

that is certainly not my intention.
would appreciate more serious comments on this feature request by maintainers and users.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2020
@neolit123 neolit123 modified the milestones: v1.20, v1.21 Dec 2, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 1, 2021
@fabriziopandini
Copy link
Member

/remove-lifecycle stale

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/UX kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

5 participants