Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjProt (/resourcedefinitions/PVC-2ABC1180-FAEB-4B82-B35F-7B7F1FBF6B09) not found! #433

Open
kvaps opened this issue Jan 27, 2025 · 7 comments

Comments

@kvaps
Copy link

kvaps commented Jan 27, 2025

linstor controller 1.29.2.

I created a few PVC using snapshot mechanism in Kubernetes, after that linstor-controller stopped booting.

I'm attaching db dump

1.tgz

@ghernadi
Copy link
Contributor

Unfortunately there is no chance to figure out how you got to this database with missing entries. I am attaching a list of JSON objects that when I create in my local setup the controller seems to be booting again.

This is the place where I usually remind people to create a backup before applying some changes to the database, but since we still have your 1.tgz "backup" attached in the first post, that should be fine.

Created object
{
  "apiVersion": "internal.linstor.linbit.com/v1-15-0",
  "kind": "SecObjectProtection",
  "metadata": {
    "creationTimestamp": "2025-01-27T11:43:22Z",
    "generation": 1,
    "name": "6cfea00753c6da3770751bc23d624fd58909ee98a9bba647e2eb08eab1a9d6bc",
    "resourceVersion": "24548",
    "uid": "5e1ddac4-fab4-4c3e-ba01-3e4ab9af374b"
  },
  "spec": {
    "creator_identity_name": "PUBLIC",
    "object_path": "/resourcedefinitions/PVC-2ABC1180-FAEB-4B82-B35F-7B7F1FBF6B09",
    "owner_role_name": "PUBLIC",
    "security_type_name": "PUBLIC"
  }
}
{
  "apiVersion": "internal.linstor.linbit.com/v1-15-0",
  "kind": "ResourceDefinitions",
  "metadata": {
    "creationTimestamp": "2025-01-27T11:43:14Z",
    "generation": 1,
    "name": "24edeb9e11a204afbf7fb5fc4eda4f1fc556a654644602021f6b79060e471117",
    "resourceVersion": "23845",
    "uid": "0946ade6-460d-48ea-848a-f4d3369d79f2"
  },
  "spec": {
    "layer_stack": "[\"DRBD\",\"STORAGE\"]",
    "resource_dsp_name": "pvc-c7b1550a-c903-45cf-9d4b-662febef2334",
    "resource_flags": 1,
    "resource_group_name": "SC-414F8532-9472-5CF5-9521-5012BA2ABF74",
    "resource_name": "PVC-C7B1550A-C903-45CF-9D4B-662FEBEF2334",
    "snapshot_dsp_name": "snapshot-07b6f5d9-bbd8-4db4-a7e0-eb6d2926edc2",
    "snapshot_name": "SNAPSHOT-07B6F5D9-BBD8-4DB4-A7E0-EB6D2926EDC2",
    "uuid": "d085cc23-5b80-4ee8-93d0-667f5d6afbf3"
  }
}
{
  "apiVersion": "internal.linstor.linbit.com/v1-15-0",
  "kind": "ResourceDefinitions",
  "metadata": {
    "creationTimestamp": "2025-01-27T11:43:14Z",
    "generation": 1,
    "name": "29014b7486c01f50e615b6ab438efa1f19a9cbada5077d2a53720284f2c7f429",
    "resourceVersion": "23842",
    "uid": "922b83c1-6e5d-4b8b-832b-32459fa851ef"
  },
  "spec": {
    "layer_stack": "[\"DRBD\",\"STORAGE\"]",
    "resource_dsp_name": "pvc-c7b1550a-c903-45cf-9d4b-662febef2334",
    "resource_flags": 1,
    "resource_group_name": "SC-414F8532-9472-5CF5-9521-5012BA2ABF74",
    "resource_name": "PVC-C7B1550A-C903-45CF-9D4B-662FEBEF2334",
    "snapshot_dsp_name": "snapshot-f61cda0f-8754-4e92-98e9-15075a605c4d",
    "snapshot_name": "SNAPSHOT-F61CDA0F-8754-4E92-98E9-15075A605C4D",
    "uuid": "6fae6ae1-2efe-4898-821d-c85b061b0496"
  }
}
{
  "apiVersion": "internal.linstor.linbit.com/v1-15-0",
  "kind": "ResourceDefinitions",
  "metadata": {
    "creationTimestamp": "2025-01-27T11:43:14Z",
    "generation": 1,
    "name": "7f921867f1c7692ddde77bd4399a2a7871fdc5506e076baa3ef109484d73b4ef",
    "resourceVersion": "23843",
    "uid": "b6cc4b76-925a-4808-b32a-9eb0a91fc8af"
  },
  "spec": {
    "layer_stack": "[\"DRBD\",\"STORAGE\"]",
    "resource_dsp_name": "pvc-c7b1550a-c903-45cf-9d4b-662febef2334",
    "resource_flags": 1,
    "resource_group_name": "SC-414F8532-9472-5CF5-9521-5012BA2ABF74",
    "resource_name": "PVC-C7B1550A-C903-45CF-9D4B-662FEBEF2334",
    "snapshot_dsp_name": "snapshot-971a9943-056c-46f4-ae84-448184492752",
    "snapshot_name": "SNAPSHOT-971A9943-056C-46F4-AE84-448184492752",
    "uuid": "11cf8abb-84f8-4e1a-8401-e3bb2fdef70c"
  }
}
{
  "apiVersion": "internal.linstor.linbit.com/v1-15-0",
  "kind": "ResourceDefinitions",
  "metadata": {
    "creationTimestamp": "2025-01-27T11:43:14Z",
    "generation": 1,
    "name": "98191392caf4fdb33751dd09e0bd2357c6128f4ed1be213e01db3982e2797908",
    "resourceVersion": "23846",
    "uid": "43b65333-82c1-4c14-b990-1571079240b9"
  },
  "spec": {
    "layer_stack": "[\"DRBD\",\"STORAGE\"]",
    "resource_dsp_name": "pvc-c7b1550a-c903-45cf-9d4b-662febef2334",
    "resource_flags": 1,
    "resource_group_name": "SC-414F8532-9472-5CF5-9521-5012BA2ABF74",
    "resource_name": "PVC-C7B1550A-C903-45CF-9D4B-662FEBEF2334",
    "snapshot_dsp_name": "snapshot-58d7257c-689a-4a5f-a06e-406ec682154b",
    "snapshot_name": "SNAPSHOT-58D7257C-689A-4A5F-A06E-406EC682154B",
    "uuid": "2e8a6e7a-81ee-4613-9a37-b9cab34689c7"
  }
}
{
  "apiVersion": "internal.linstor.linbit.com/v1-15-0",
  "kind": "ResourceDefinitions",
  "metadata": {
    "creationTimestamp": "2025-01-27T11:43:14Z",
    "generation": 1,
    "name": "b0b4cf509d6b1befeb682c0f755b06625c30fc634ea2201e897a829dc540fed2",
    "resourceVersion": "23844",
    "uid": "e46ff53d-4902-4ce6-a130-dae4ae0a65fc"
  },
  "spec": {
    "layer_stack": "[\"DRBD\",\"STORAGE\"]",
    "resource_dsp_name": "pvc-c7b1550a-c903-45cf-9d4b-662febef2334",
    "resource_flags": 1,
    "resource_group_name": "SC-414F8532-9472-5CF5-9521-5012BA2ABF74",
    "resource_name": "PVC-C7B1550A-C903-45CF-9D4B-662FEBEF2334",
    "snapshot_dsp_name": "snapshot-feb02af6-7b64-46cd-9bce-f0f806bece09",
    "snapshot_name": "SNAPSHOT-FEB02AF6-7B64-46CD-9BCE-F0F806BECE09",
    "uuid": "3627c859-0430-4bdc-b38b-120570d83f87"
  }
}
{
  "apiVersion": "internal.linstor.linbit.com/v1-15-0",
  "kind": "ResourceDefinitions",
  "metadata": {
    "creationTimestamp": "2025-01-27T11:43:14Z",
    "generation": 1,
    "name": "c7263fcf19f10fbc5a9ffa37864110dd72b11c6b2e757a714eae2f1626a2eada",
    "resourceVersion": "23841",
    "uid": "b499006e-d285-42cb-863d-a2d884856959"
  },
  "spec": {
    "layer_stack": "[\"DRBD\",\"STORAGE\"]",
    "resource_dsp_name": "pvc-c7b1550a-c903-45cf-9d4b-662febef2334",
    "resource_flags": 1,
    "resource_group_name": "SC-414F8532-9472-5CF5-9521-5012BA2ABF74",
    "resource_name": "PVC-C7B1550A-C903-45CF-9D4B-662FEBEF2334",
    "snapshot_dsp_name": "snapshot-51663312-2043-4410-8e5c-8c787e9bafc0",
    "snapshot_name": "SNAPSHOT-51663312-2043-4410-8E5C-8C787E9BAFC0",
    "uuid": "d249f6c5-32c6-49ba-babf-3098c85417b2"
  }
}

Let me know if this helped.

@kvaps
Copy link
Author

kvaps commented Jan 27, 2025

Thank you I have solved this already by removing defect resources from the db

@kvaps
Copy link
Author

kvaps commented Jan 28, 2025

@ghernadi haven't you thought about some kind of fsck tool for the database?
I think it might be useful as these problems are not new for linstor-controller

@ghernadi
Copy link
Contributor

I am not sure if such a tool would really help. Sure, I do see the point that there would be a chance that such a tool could repair a database like in your case, but if other entries would have been missing there is no chance any tool could just reconstruct the missing data...

You are correct that this is not the first case of such problems but, although I am not 100% sure about this, I believe such issues only occur in K8s setups. Just to be clear, I do not want to blame K8s and call it a day, but that the issue might be somewhere in LINSTOR's K8s driver, or LINSTOR's usage of multiple versions of CRD schemas, or the rollback-mechanism of LINSTOR that might have issues or something else within LINSTOR that is related with K8s.

Such a tool that you are talking about would only fix the symptoms, but I would be rather interested in fixing the root problem.
Unfortunately such issues occur too seldom to give us enough context information for us to better understand and limit the possible causes at least a bit..

@rp-
Copy link
Contributor

rp- commented Jan 28, 2025

Maybe @kvaps meant, just a checker to prove current data integrity and so you could better spot when things got wrong.
Right now you only know once you restarted the linstor-controller, which is hopefully not that often.

@kvaps
Copy link
Author

kvaps commented Jan 28, 2025

I meant a tool that would help fix the database in some way, just to make it functional.
For example, it could check the database’s integrity and attempt to fix any issues if it’s broken.

If no solution is available, the tool should remove all defective resources to get it running again.
It could provide a choice:
"These resources are broken, and we couldn’t find a way to fix them, should we remove them? [nyae]"

Maybe @kvaps meant, just a checker to prove current data integrity and so you could better spot when things got wrong.
Right now you only know once you restarted the linstor-controller, which is hopefully not that often.

This might be useful for set-up allerting, but still have no idea what to do with such alerts 🙂

You are correct that this is not the first case of such problems but, although I am not 100% sure about this, I believe such issues only occur in K8s setups. Just to be clear, I do not want to blame K8s and call it a day, but that the issue might be somewhere in LINSTOR's K8s driver, or LINSTOR's usage of multiple versions of CRD schemas, or the rollback-mechanism of LINSTOR that might have issues or something else within LINSTOR that is related with K8s.

LINSTOR uses Kubernetes not in very standard way. In fact, it creates and modifies many resources at once.
I can make a guess the issue might be related to the built-in api-server rate limiter.
Does LINSTOR handle "too many requests" errors correctly?

@boedy
Copy link

boedy commented Jan 28, 2025

Unfortunately such issues occur too seldom to give us enough context information for us to better understand and limit the possible causes at least a bit..

Unfortunately, I don’t share the view that these issues are too rare to diagnose. We’ve encountered at least five separate incidents in the past 12 months where the LINSTOR database ended up in a corrupted state. Some of these are documented in #415 and this forum post. In fact, one of our clusters is currently down because the controller fails to start due to a corrupted state which we don't know how to solve. In previous occasions the error logs would give me some pointers on which records to delete from the database. Repeating the process until eventually the controller would boot again.

I agree that addressing the root cause is ultimately the best approach. However, even a single resource corruption can bring down the entire system, causing major downtime. A tool to restore the database to a functional state would at least keep us operational while the root cause can be investigated in the background.

Per the Piraesu operator’s default, we use Kubernetes as the backing datastore. With having faced multiple corruptions, I’ve often wondered if using an external PostgreSQL database would have yielded a different outcome, especially based on #338 (comment). I was told LINSTOR does build its own basic transaction support, but how would that compare with a backing postgres datastore?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants