First Chaos Day!
First Chaos day 🎉
diff --git a/2020/06/11/high-cpu-gateway/index.html b/2020/06/11/high-cpu-gateway/index.html index 2f5556e9b..481171b4b 100644 --- a/2020/06/11/high-cpu-gateway/index.html +++ b/2020/06/11/high-cpu-gateway/index.html @@ -6,7 +6,7 @@High CPU load on Standalone Gateway
-
diff --git a/2020/06/18/correlate-message-after-failover/index.html b/2020/06/18/correlate-message-after-failover/index.html
index 3cb9b1c59..eeafc1eba 100644
--- a/2020/06/18/correlate-message-after-failover/index.html
+++ b/2020/06/18/correlate-message-after-failover/index.html
@@ -6,7 +6,7 @@
Correlate Message after failover
-
diff --git a/2020/06/25/gateway-network-partition/index.html b/2020/06/25/gateway-network-partition/index.html
index c53975309..7618b5aca 100644
--- a/2020/06/25/gateway-network-partition/index.html
+++ b/2020/06/25/gateway-network-partition/index.html
@@ -6,7 +6,7 @@
Gateway Network Partition
-
diff --git a/2020/07/02/extract-k8-resources/index.html b/2020/07/02/extract-k8-resources/index.html
index 4da775811..4a9bc6e91 100644
--- a/2020/07/02/extract-k8-resources/index.html
+++ b/2020/07/02/extract-k8-resources/index.html
@@ -6,7 +6,7 @@
Extract K8 resources from namespace
-
diff --git a/2020/07/09/timer-and-huge-variables/index.html b/2020/07/09/timer-and-huge-variables/index.html
index 6a0c84ccd..24a3960cc 100644
--- a/2020/07/09/timer-and-huge-variables/index.html
+++ b/2020/07/09/timer-and-huge-variables/index.html
@@ -6,7 +6,7 @@
Experiment with Timers and Huge Variables
-
diff --git a/2020/07/16/big-multi-instance/index.html b/2020/07/16/big-multi-instance/index.html
index 917fe0b4e..cbcf92d4a 100644
--- a/2020/07/16/big-multi-instance/index.html
+++ b/2020/07/16/big-multi-instance/index.html
@@ -6,7 +6,7 @@
Big Multi Instance
-
diff --git a/2020/07/30/experiment-without-exporters/index.html b/2020/07/30/experiment-without-exporters/index.html
index 2a7fea4c6..09d456dee 100644
--- a/2020/07/30/experiment-without-exporters/index.html
+++ b/2020/07/30/experiment-without-exporters/index.html
@@ -6,7 +6,7 @@
Experiment without Exporters
-
diff --git a/2020/08/06/low-load/index.html b/2020/08/06/low-load/index.html
index 155516243..09c66a4f4 100644
--- a/2020/08/06/low-load/index.html
+++ b/2020/08/06/low-load/index.html
@@ -6,7 +6,7 @@
Experiment with Low Load
-
diff --git a/2020/08/20/experiment-with-camunda-cloud/index.html b/2020/08/20/experiment-with-camunda-cloud/index.html
index d08062286..a641355c9 100644
--- a/2020/08/20/experiment-with-camunda-cloud/index.html
+++ b/2020/08/20/experiment-with-camunda-cloud/index.html
@@ -6,7 +6,7 @@
Experiment with Camunda Cloud
In order to make our chaos experiments more realistic we have setup a new gke cluster, which is similar to the Camunda Cloud gke cluster. diff --git a/2020/10/06/toxi-proxy/index.html b/2020/10/06/toxi-proxy/index.html index 370c99696..c85b2d385 100644 --- a/2020/10/06/toxi-proxy/index.html +++ b/2020/10/06/toxi-proxy/index.html @@ -6,7 +6,7 @@
Play around with ToxiProxy
First chaos day since my parental leave 🎉.
diff --git a/2020/10/13/multiple-leader-changes/index.html b/2020/10/13/multiple-leader-changes/index.html index d0f2cb89e..7619cafe6 100644 --- a/2020/10/13/multiple-leader-changes/index.html +++ b/2020/10/13/multiple-leader-changes/index.html @@ -6,7 +6,7 @@Multiple Leader Changes
Today I wanted to add new chaostoolkit experiment, which we can automate. diff --git a/2020/10/20/non-graceful-shutdown/index.html b/2020/10/20/non-graceful-shutdown/index.html index fa7256545..55b59d448 100644 --- a/2020/10/20/non-graceful-shutdown/index.html +++ b/2020/10/20/non-graceful-shutdown/index.html @@ -6,7 +6,7 @@
Non-graceful Shutdown Broker
Today I had not much time for the chaos day, because of writing Gameday Summary, Incident review, taking part of incidents etc. So enough chaos for one day :)
diff --git a/2020/10/27/standalone-gw-memory/index.html b/2020/10/27/standalone-gw-memory/index.html index 066b71a27..c2c281170 100644 --- a/2020/10/27/standalone-gw-memory/index.html +++ b/2020/10/27/standalone-gw-memory/index.html @@ -6,7 +6,7 @@Gateway memory consumption
In the last weeks I check multiple benchmarks and clusters in incidents. Often I had the feeling that the memory consumption from the gateway is not ideal diff --git a/2020/11/03/investigate-failing-tests/index.html b/2020/11/03/investigate-failing-tests/index.html index 54b79ed5f..db76556a8 100644 --- a/2020/11/03/investigate-failing-tests/index.html +++ b/2020/11/03/investigate-failing-tests/index.html @@ -6,7 +6,7 @@
Investigate failing Chaos Tests
Today as part of the Chaos Day I wanted to investigate why our current Chaos Tests are failing and why our targeting cluster has been broken by them, diff --git a/2020/11/11/job-timeouts/index.html b/2020/11/11/job-timeouts/index.html index 41eca7a0a..e75d2f496 100644 --- a/2020/11/11/job-timeouts/index.html +++ b/2020/11/11/job-timeouts/index.html @@ -6,7 +6,7 @@
Many Job Timeouts
In the last game day (on friday 06.11.2020) I wanted to test whether we can break a partition if many messages time out at the same time. What I did was I send many many messages with a decreasing TTL, which all targeting a specific point in time, such that they will all timeout at the same time. I expected that if this happens that the processor will try to time out all at once and break because the batch is to big. Fortunately this didn't happen, the processor was able to handle this.
diff --git a/2020/11/24/message-correlation-after-failover/index.html b/2020/11/24/message-correlation-after-failover/index.html index ee1517106..ef4e44c6e 100644 --- a/2020/11/24/message-correlation-after-failover/index.html +++ b/2020/11/24/message-correlation-after-failover/index.html @@ -6,7 +6,7 @@Message Correlation after Failover
Today I wanted to finally implement an experiment which I postponed for long time, see #24. diff --git a/2021/01/07/disconnect-leader-and-follower/index.html b/2021/01/07/disconnect-leader-and-follower/index.html index 9263447d1..9ccb22f8f 100644 --- a/2021/01/07/disconnect-leader-and-follower/index.html +++ b/2021/01/07/disconnect-leader-and-follower/index.html @@ -6,7 +6,7 @@
Disconnect Leader and one Follower
Happy new year everyone 🎉
diff --git a/2021/01/19/network-partition/index.html b/2021/01/19/network-partition/index.html index 6a170d5d4..0d07ca1ea 100644 --- a/2021/01/19/network-partition/index.html +++ b/2021/01/19/network-partition/index.html @@ -6,7 +6,7 @@Network partitions
As you can see, I migrated the old chaos day summaries to github pages, for better readability. diff --git a/2021/01/26/deployments/index.html b/2021/01/26/deployments/index.html index e43137b4e..b63859510 100644 --- a/2021/01/26/deployments/index.html +++ b/2021/01/26/deployments/index.html @@ -6,7 +6,7 @@
Deployment Distribution
On this chaos day we wanted to experiment a bit with deployment's and there distribution.
diff --git a/2021/02/23/automate-deployments-dist/index.html b/2021/02/23/automate-deployments-dist/index.html index bf93dd525..ca45ff255 100644 --- a/2021/02/23/automate-deployments-dist/index.html +++ b/2021/02/23/automate-deployments-dist/index.html @@ -6,7 +6,7 @@Automating Deployment Distribution Chaos Experiment
This time I wanted to automate a chaos experiment via the ChaosToolkit, which I did on the last chaos day. For a recap check out the last chaos day summary.
diff --git a/2021/03/09/cont-workflow-instance/index.html b/2021/03/09/cont-workflow-instance/index.html index 7adcd9f09..6722ff3db 100644 --- a/2021/03/09/cont-workflow-instance/index.html +++ b/2021/03/09/cont-workflow-instance/index.html @@ -6,7 +6,7 @@Fault-tolerant processing of process instances
Today I wanted to add another chaos experiment, to increase our automated chaos experiments collection. This time we will deploy a process model (with timer start event), restart a node and complete the process instance via zbctl
.
Camunda Cloud network partition
This time Deepthi was joining me on my regular Chaos Day. 🎉
diff --git a/2021/03/30/set-file-immutable/index.html b/2021/03/30/set-file-immutable/index.html index 4ec733f40..922b5148a 100644 --- a/2021/03/30/set-file-immutable/index.html +++ b/2021/03/30/set-file-immutable/index.html @@ -6,7 +6,7 @@Set file immutable
This chaos day was a bit different. Actually I wanted to experiment again with camunda cloud and verify that our high load chaos experiments are now working with the newest cluster plans, see zeebe-cluster-testbench#135. diff --git a/2021/04/03/bpmn-meets-chaos-engineering/index.html b/2021/04/03/bpmn-meets-chaos-engineering/index.html index 3b1f3381e..56a653aa5 100644 --- a/2021/04/03/bpmn-meets-chaos-engineering/index.html +++ b/2021/04/03/bpmn-meets-chaos-engineering/index.html @@ -6,7 +6,7 @@
BPMN meets Chaos Engineering
On the first of April (2021) we ran our Spring Hackday at Camunda. This is an event where the developers at camunda come together to work on projects they like or on new ideas/approaches they want to try out. This time we (Philipp and me) wanted to orchestrate our Chaos Experiments with BPMN. If you already know how we automated our chaos experiments before, you can skip the next section diff --git a/2021/04/29/Corrupted-Snapshot/index.html b/2021/04/29/Corrupted-Snapshot/index.html index bf9ac7fb3..8c2a302dd 100644 --- a/2021/04/29/Corrupted-Snapshot/index.html +++ b/2021/04/29/Corrupted-Snapshot/index.html @@ -6,7 +6,7 @@
Corrupted Snapshot Experiment Investigation
A while ago we have written an experiment, which should verify that followers are not able to become leader, if they have a corrupted snapshot. You can find that specific experiment here. This experiment was executed regularly against Production-M and Production-S Camunda Cloud cluster plans. With the latest changes, in the upcoming 1.0 release, we changed some behavior in regard to detect snapshot corruption on followers.
diff --git a/2021/05/25/Reset-Clock/index.html b/2021/05/25/Reset-Clock/index.html index ae24dfc5d..65200d39d 100644 --- a/2021/05/25/Reset-Clock/index.html +++ b/2021/05/25/Reset-Clock/index.html @@ -6,7 +6,7 @@Time travel Experiment
Recently we run a Game day where a lot of messages with high TTL have been stored in the state. This was based on an earlier incident, which we had seen in production. One suggested approach to resolve that incident was to increase the time, such that all messages are removed from the state. This and the fact that summer and winter time shifts can cause in other systems evil bugs, we wanted to find out how our system can handle time shifts. Phil joined me as participant and observer. There was a related issue which covers this topic as well, zeebe-chaos#3.
diff --git a/2021/06/08/Full-Disk/index.html b/2021/06/08/Full-Disk/index.html index 8567e489b..c8abd351b 100644 --- a/2021/06/08/Full-Disk/index.html +++ b/2021/06/08/Full-Disk/index.html @@ -6,7 +6,7 @@Full Disk Recovery
On this chaos day we wanted to experiment with OOD recovery and ELS connection issues. This is related to the following issues from our hypothesis backlog: zeebe-chaos#32 and zeebe-chaos#14. This time @Nico joined me.
diff --git a/2021/07/06/Slow-Network/index.html b/2021/07/06/Slow-Network/index.html index 9e885252e..4a0f4e0b9 100644 --- a/2021/07/06/Slow-Network/index.html +++ b/2021/07/06/Slow-Network/index.html @@ -6,7 +6,7 @@Slow Network
On a previous Chaos Day we played around with ToxiProxy , which allows injecting failures on the network level. For example dropping packages, causing latency etc.
diff --git a/2021/09/23/Old-Clients/index.html b/2021/09/23/Old-Clients/index.html index 01be6dab3..23d67ccde 100644 --- a/2021/09/23/Old-Clients/index.html +++ b/2021/09/23/Old-Clients/index.html @@ -6,7 +6,7 @@Old-Clients
It has been awhile since the last post, I'm happy to be back.
diff --git a/2021/10/05/recovery-time/index.html b/2021/10/05/recovery-time/index.html index 71e577b52..7e5b1a458 100644 --- a/2021/10/05/recovery-time/index.html +++ b/2021/10/05/recovery-time/index.html @@ -6,7 +6,7 @@Recovery (Fail Over) time
In the last quarter we worked on a new "feature" which is called "building state on followers". In short, diff --git a/2021/10/29/Throughput-on-big-state/index.html b/2021/10/29/Throughput-on-big-state/index.html index a07cbb2b4..c06b71fcb 100644 --- a/2021/10/29/Throughput-on-big-state/index.html +++ b/2021/10/29/Throughput-on-big-state/index.html @@ -6,7 +6,7 @@
Throughput on big state
In this chaos day we wanted to prove the hypothesis that the throughput should not significantly change even if we have bigger state, see zeebe-chaos#64
diff --git a/2021/11/11/Not-produce-duplicate-Keys/index.html b/2021/11/11/Not-produce-duplicate-Keys/index.html index 64e3c3bac..c0f2e2826 100644 --- a/2021/11/11/Not-produce-duplicate-Keys/index.html +++ b/2021/11/11/Not-produce-duplicate-Keys/index.html @@ -6,7 +6,7 @@Not produce duplicate Keys
Due to some incidents and critical bugs we observed in the last weeks, I wanted to spent some time to understand the issues better and experiment how we could detect them. One of the issue we have observed was that keys were generated more than once, so they were no longer unique (#8129). I will describe this property in the next section more in depth.
diff --git a/2021/11/24/Worker-count-should-not-impact-performance/index.html b/2021/11/24/Worker-count-should-not-impact-performance/index.html index 4fb1c701f..1a4c683af 100644 --- a/2021/11/24/Worker-count-should-not-impact-performance/index.html +++ b/2021/11/24/Worker-count-should-not-impact-performance/index.html @@ -6,7 +6,7 @@Worker count should not impact performance
In this chaos day we experimented with the worker count, since we saw recently that it might affect the performance (throughput) negatively if there are more workers deployed. This is related to #7955 and #8244.
diff --git a/2022/01/19/big-variables/index.html b/2022/01/19/big-variables/index.html index 2c509afca..ced66c2c5 100644 --- a/2022/01/19/big-variables/index.html +++ b/2022/01/19/big-variables/index.html @@ -6,7 +6,7 @@