Merge remote-tracking branch 'origin/feature/ZENKO-1379-crrPendingAndBatchLimit' into w/8.0/feature/ZENKO-1379-crrPendingAndBatchLimit

jonathan-gramain · jonathan-gramain · commit 79536703c8d9 · 2019-04-29T14:28:33.000-07:00
diff --git a/README.md b/README.md
@@ -1,51 +1,202 @@
 # s3utils
 S3 Connector and Zenko Utilities
 
-Run the docker container as
+Run the Docker container as
 ```
-docker run --net=host -e 'ACCESS_KEY=accessKey' -e 'SECRET_KEY=secretkey' -e 'ENDPOINT=http://127.0.0.1:8000' zenko/s3utils node scriptName bucket1[,bucket2...]
+docker run --net=host -e 'ACCESS_KEY=accessKey' -e 'SECRET_KEY=secretKey' -e 'ENDPOINT=http://127.0.0.1:8000' zenko/s3utils node scriptName bucket1[,bucket2...]
 ```
 
-Optionally, the environment variable "WORKERS" may be set to specify
-how many parallel workers should run, otherwise a default of 10
-workers will be used.
+## Trigger CRR on existing objects
 
-## Trigger CRR on objects that were put before replication was enabled on the bucket
-
-1. Enable versioning and setup replication on the bucket
+1. Enable versioning and set up replication on the bucket.
 2. Run script as
 ```
 node crrExistingObjects.js testbucket1,testbucket2
 ```
 
-## Trigger CRR on *all* objects of a bucket
+### Extra environment variables
+
+Additionally, the following extra environment variables can be passed
+to the script to modify its behavior:
+
+#### TARGET_REPLICATION_STATUS
+
+Comma-separated list of replication statuses to target for CRR
+requeueing. The recognized statuses are:
+
+* **NEW**: No replication status is attached to the object. This is
+   the state of objects written without any CRR policy attached to
+   the bucket that would have triggered CRR on them.
+
+* **PENDING**: The object replication status is PENDING.
+
+* **COMPLETED**: The object replication status is COMPLETED.
+
+* **FAILED**: The object replication status is FAILED.
+
+* **REPLICA**: The object replication status is REPLICA (objects that
+   were put to a target site via CRR have this status).
+
+The default script behavior is to affect objects that have no
+replication status attached (so equivalent to
+`TARGET_REPLICATION_STATUS=NEW`).
+
+Examples:
+
+`TARGET_REPLICATION_STATUS=PENDING,COMPLETED`
+
+Requeue objects that either have a replication status of PENDING or
+COMPLETED for CRR, do not requeue the others.
+
+`TARGET_REPLICATION_STATUS=NEW,PENDING,COMPLETED,FAILED`
+
+Trigger CRR on all original source objects (not replicas) in a bucket.
+
+`TARGET_REPLICATION_STATUS=REPLICA`
+
+For disaster recovery notably, it may be useful to reprocess REPLICA 
+objects to re-sync a backup bucket to the primary site.
+
+#### WORKERS
+
+Specify how many parallel workers should run to update object
+metadata. The default is 10 parallel workers.
+
+Example:
+
+`WORKERS=50`
+
+#### MAX_UPDATES
+
+Specify a maximum number of metadata updates to execute before
+stopping the script.
+
+If the script reaches this limit, it outputs a log line containing
+the KeyMarker and VersionIdMarker to pass to the next invocation (as
+environment variables `KEY_MARKER` and `VERSION_ID_MARKER`) and the
+updated bucket list without the already completed buckets. At the next
+invocation of the script, those two environment variables must be
+set and the updated bucket list passed on the command line to resume
+where the script stopped.
+
+The default is unlimited (will process the complete listing of buckets
+passed on the command line).
+
+**If the script queues too many objects and Backbeat cannot
+ process them quickly enough, Kafka may drop the oldest entries**,
+ and the associated objects will stay in the **PENDING** state
+ permanently without being replicated. When the number of objects
+ is large, it is a good idea to limit the batch size and wait 
+ for CRR to complete between invocations.
+
+Example:
+
+`MAX_UPDATES=10000`
+
+This limits the number of updates to 10,000 objects, which requeues
+a maximum of 10,000 objects to replicate before the script stops.
+
+#### KEY_MARKER
+
+Set to resume from where an earlier invocation stopped (see
+[MAX_UPDATES](#MAX_UPDATES)).
+
+Example:
 
-This mode includes the objects that have already been replicated or
-that have a replication status attached.
+`KEY_MARKER="some/key"`
 
-For disaster recovery notably, to re-sync a backup bucket to the
-primary site, it may be useful to reprocess all objects regardless of
-the existence of a current replication status (e.g. "REPLICA").
+#### VERSION_ID_MARKER
 
-Follow the above steps for using "crrExistingObjects" script, and
-specify an extra environment variable `-e "PROCESS_ALL=true"` to force
-the script to reset the replication status of all objects in the
-bucket to "pending", which will force a replication for all objects.
+Set to resume from where an earlier invocation stopped (see
+[MAX_UPDATES](#MAX_UPDATES)).
+
+Example:
+
+`VERSION_ID_MARKER="123456789"`
+
+
+### Example use cases
+
+#### CRR existing objects after setting a replication policy for the first time
+
+For this use case, it's not necessary to pass any extra environment
+variable, because the default behavior is to process objects without a
+replication status attached.
+
+To avoid requeuing too many entries at once, pass this value:
+
+```
+export MAX_UPDATES=10000
+```
+
+#### Re-queue objects stuck in PENDING state
+
+If Kafka has dropped replication entries, leaving objects stuck in a
+PENDING state without being replicated, pass the following extra
+environment variables to reprocess them:
+
+```
+export TARGET_REPLICATION_STATUS=PENDING
+export MAX_UPDATES=10000
+```
+
+**Warning**: This may cause replication of objects already in the
+Kafka queue to repeat. To avoid this, set the backbeat consumer
+offsets of "backbeat-replication" Kafka topic to the latest topic
+offsets before launching the script, to skip over the existing
+consumer log.
+
+#### Replicate entries that failed a previous replication
+
+If entries have permanently failed to replicate with a FAILED
+replication status and were lost in the failed CRR API, it's still 
+possible to re-attempt replication later with the following
+extra environment variables:
+
+```
+export TARGET_REPLICATION_STATUS=FAILED
+export MAX_UPDATES=10000
+```
+
+#### Re-sync a primary site completely to a new DR site
+
+To re-sync objects to a new DR site (for example, when the original
+DR site is lost) force a new replication of all original objects 
+with the following environment variables (after setting the proper
+replication configuration to the DR site bucket):
+
+```
+export TARGET_REPLICATION_STATUS=NEW,PENDING,COMPLETED,FAILED
+export MAX_UPDATES=10000
+```
+
+#### Re-sync a DR site back to the primary site
+
+When objects have been lost from the primary site you can re-sync
+objects from the DR site to the primary site by re-syncing the
+objects that have a REPLICA status with the following environment
+variables (after setting the proper replication configuration
+from the DR bucket to the primary bucket):
+
+```
+export TARGET_REPLICATION_STATUS=REPLICA
+export MAX_UPDATES=10000
+```
 
 # Empty a versioned bucket
 
-This script deletes all versions of objects in the bucket including delete markers,
+This script deletes all versions of objects in the bucket, including delete markers,
 and aborts any ongoing multipart uploads to prepare the bucket for deletion.
 
-**Note: This will delete data associated with the objects and it's not recoverable**
+**Note: This deletes the data associated with objects and is not recoverable**
 ```
 node cleanupBuckets.js testbucket1,testbucket2
 ```
 
 # List objects that failed replication
 
-This script can print the list of objects that failed replication to stdout by
-taking a comma-separated list of buckets. Run the command as
+This script prints the list of objects that failed replication to stdout,
+following a comma-separated list of buckets. Run the command as
 
 ````
 node listFailedObjects testbucket1,testbucket2
diff --git a/crrExistingObjects.js b/crrExistingObjects.js
@@ -13,9 +13,14 @@ const ACCESS_KEY = process.env.ACCESS_KEY;
 const SECRET_KEY = process.env.SECRET_KEY;
 const ENDPOINT = process.env.ENDPOINT;
 const SITE_NAME = process.env.SITE_NAME;
-const PROCESS_ALL = process.env.PROCESS_ALL === 'true';
+let TARGET_REPLICATION_STATUS = process.env.TARGET_REPLICATION_STATUS;
 const WORKERS = (process.env.WORKERS &&
                  Number.parseInt(process.env.WORKERS, 10)) || 10;
+const MAX_UPDATES = (process.env.MAX_UPDATES &&
+                     Number.parseInt(process.env.MAX_UPDATES, 10));
+let KEY_MARKER = process.env.KEY_MARKER;
+let VERSION_ID_MARKER = process.env.VERSION_ID_MARKER;
+
 const LISTING_LIMIT = 1000;
 const LOG_PROGRESS_INTERVAL_MS = 10000;
 
@@ -36,11 +41,23 @@ if (!SECRET_KEY) {
     log.fatal('SECRET_KEY not defined');
     process.exit(1);
 }
-if (PROCESS_ALL) {
-    log.warn('PROCESS_ALL environment option is active: ' +
-             'ALL objects in the bucket(s) will be reprocessed for CRR!');
+if (!TARGET_REPLICATION_STATUS) {
+    TARGET_REPLICATION_STATUS = 'NEW';
 }
 
+const replicationStatusToProcess = TARGET_REPLICATION_STATUS.split(',');
+replicationStatusToProcess.forEach(state => {
+    if (!['NEW', 'PENDING', 'COMPLETED', 'FAILED', 'REPLICA'].includes(state)) {
+        log.fatal('invalid TARGET_REPLICATION_STATUS environment: must be a ' +
+                  'comma-separated list of replication statuses to requeue, ' +
+                  'as NEW, PENDING, COMPLETED, FAILED or REPLICA.');
+        process.exit(1);
+    }
+});
+log.info('Objects with replication status ' +
+         `${replicationStatusToProcess.join(' or ')} ` +
+         'will be reset to PENDING to trigger CRR');
+
 const options = {
     accessKeyId: ACCESS_KEY,
     secretAccessKey: SECRET_KEY,
@@ -66,13 +83,24 @@ let nErrors = 0;
 let bucketInProgress = null;
 
 function _logProgress() {
-    log.info(`progress update: ${nProcessed - nSkipped} touched, ` +
+    log.info(`progress update: ${nProcessed - nSkipped} updated, ` +
              `${nSkipped} skipped, ${nErrors} errors, ` +
              `bucket in progress: ${bucketInProgress || '(none)'}`);
 }
 
 const logProgressInterval = setInterval(_logProgress, LOG_PROGRESS_INTERVAL_MS);
 
+function _objectShouldBeUpdated(objMD) {
+    return replicationStatusToProcess.some(filter => {
+        if (filter === 'NEW') {
+            return (!objMD.replicationInfo ||
+                    objMD.replicationInfo.status === '');
+        }
+        return (objMD.replicationInfo &&
+                objMD.replicationInfo.status === filter);
+    });
+}
+
 function _markObjectPending(bucket, key, versionId, storageClass,
                             repConfig, cb) {
     let objMD;
@@ -84,12 +112,9 @@ function _markObjectPending(bucket, key, versionId, storageClass,
             Key: key,
             VersionId: versionId,
         }, next),
-        // update replication info and put back object blob
         (mdRes, next) => {
             objMD = JSON.parse(mdRes.Body);
-            if (!PROCESS_ALL &&
-                objMD.replicationInfo && objMD.replicationInfo.status !== '') {
-                // skip object since it's already marked for crr
+            if (!_objectShouldBeUpdated(objMD)) {
                 skip = true;
                 return next();
             }
@@ -125,6 +150,7 @@ function _markObjectPending(bucket, key, versionId, storageClass,
                 return next();
             });
         },
+        // update replication info and put back object blob
         next => {
             if (skip) {
                 return next();
@@ -215,6 +241,15 @@ function triggerCRROnBucket(bucketName, cb) {
     let KeyMarker = null;
     bucketInProgress = bucket;
     log.info(`starting task for bucket: ${bucket}`);
+    if (KEY_MARKER || VERSION_ID_MARKER) {
+        // resume from where we left off in previous script launch
+        KeyMarker = KEY_MARKER;
+        VersionIdMarker = VERSION_ID_MARKER;
+        KEY_MARKER = undefined;
+        VERSION_ID_MARKER = undefined;
+        log.info(`resuming at: KeyMarker=${KeyMarker} ` +
+                 `VersionIdMarker=${VersionIdMarker}`);
+    }
     doWhilst(
         done => _listObjectVersions(bucket, VersionIdMarker, KeyMarker,
             (err, data) => {
@@ -227,6 +262,30 @@ function triggerCRROnBucket(bucketName, cb) {
                 return _markPending(bucket, data.Versions, done);
             }),
         () => {
+            if (nProcessed - nSkipped >= MAX_UPDATES) {
+                _logProgress();
+                let remainingBuckets;
+                if (VersionIdMarker || KeyMarker) {
+                    // next bucket to process is still the current one
+                    remainingBuckets = BUCKETS.slice(
+                        BUCKETS.findIndex(bucket => bucket === bucketName));
+                } else {
+                    // next bucket to process is the next in bucket list
+                    remainingBuckets = BUCKETS.slice(
+                        BUCKETS.findIndex(bucket => bucket === bucketName) + 1);
+                }
+                let message =
+                    'reached update count limit, resuming from this ' +
+                    'point can be achieved by re-running the script with ' +
+                    `the bucket list "${remainingBuckets.join(',')}"`;
+                if (VersionIdMarker || KeyMarker) {
+                    message += ' and the following environment variables set: '
+                        + `KEY_MARKER=${KeyMarker} ` +
+                        `VERSION_ID_MARKER=${VersionIdMarker}`;
+                }
+                log.info(message);
+                process.exit(0);
+            }
             if (VersionIdMarker || KeyMarker) {
                 return true;
             }