Skip to content

Fix Race Condition in OnchainPublisher Account Manager #555

@0xjeremyfrank

Description

@0xjeremyfrank

Summary

The AccountManager in the onchain publisher has a race condition that causes publish-nodes tasks to fail with TypeError: Cannot read properties of undefined (reading 'address') when all accounts are removed from the pool during high-concurrency scenarios.

Problem

When multiple transactions fail simultaneously, the account pool can become empty, causing subsequent transaction registrations to fail.

Error Observed

TypeError: Cannot read properties of undefined (reading 'address')
    at accounts.js:41:46

Root Cause

In apps/backend/src/infrastructure/services/upload/onchainPublisher/accounts.ts:

const getNextAccount = () => {
  const account = accounts[trxCounter % accounts.length]  // undefined when accounts.length === 0
  trxCounter++
  return account
}

const removeAccount = (address: string) => {
  const index = accounts.findIndex((account) => account.address === address)
  if (index !== -1) {
    accounts.splice(index, 1)
  }

  if (accounts.length === 0) {
    return initAccounts()  // async, but not awaited by callers
  }
}

Issues:

  1. getNextAccount() returns undefined when accounts.length === 0 (division by zero edge case)
  2. removeAccount() calls initAccounts() asynchronously but doesn't block new registrations while accounts are being reloaded
  3. Multiple concurrent removeAccount() calls can race to empty the array

Trigger Conditions

  • High message throughput (RABBITMQ_PREFETCH=100)
  • High transaction concurrency (MAX_CONCURRENT_UPLOADS=40)
  • Substrate RPC returns nonce collision errors (1014: Priority is too low)
  • Failed transactions trigger removeAccount(), eventually depleting the pool

Proposed Solution

export const createAccountManager = async (api: ApiPromise) => {
  let accounts = getAccounts()
  let isReinitializing = false

  const uniqueExec = pLimit(1)
  const nonceByAccount: Record<string, number> = {}
  let trxCounter = 0

  const initAccounts = () =>
    uniqueExec(async () => {
      isReinitializing = true
      try {
        accounts = getAccounts()
        const promises = accounts.map(async (account) => {
          const nonce = await getOnChainNonce(api, account.address)
          nonceByAccount[account.address] = nonce
        })
        await Promise.all(promises)
      } finally {
        isReinitializing = false
      }
    })

  const getNextAccount = () => {
    if (accounts.length === 0) {
      throw new Error('No accounts available')
    }
    const account = accounts[trxCounter % accounts.length]
    trxCounter++
    return account
  }

  const registerTransaction = () =>
    uniqueExec(async () => {
      // Wait for reinitialization if in progress
      if (accounts.length === 0 || isReinitializing) {
        await initAccounts()
      }
      
      const account = getNextAccount()
      const nonce = nonceByAccount[account.address]
      nonceByAccount[account.address] = nonce + 1
      return { account, nonce }
    })

  const removeAccount = async (address: string) => {
    return uniqueExec(async () => {
      const index = accounts.findIndex((account) => account.address === address)
      if (index !== -1) {
        accounts.splice(index, 1)
      }

      if (accounts.length === 0) {
        await initAccounts()
      }
    })
  }

  await initAccounts()

  return { registerTransaction, removeAccount }
}

Key Changes

  1. Guard against empty array: getNextAccount() throws explicitly when no accounts available
  2. Track reinitialization state: isReinitializing flag prevents new transactions during account reload
  3. Make removeAccount async: Ensure callers can await the reinitialization
  4. Serialize all operations: Use uniqueExec consistently to prevent race conditions

Alternative Considerations

  1. Circuit breaker pattern: Stop accepting new tasks when error rate exceeds threshold
  2. Retry with backoff: Add exponential backoff before reinitializing accounts
  3. Separate nonce tracking: Query fresh nonce from chain instead of local tracking (slower but more reliable)
  4. Account balance monitoring: Alert when any publishing account balance falls below threshold
  5. Never fully deplete pool: Keep at least 1 account active even on failures, or pause processing

Testing

  • Unit test: registerTransaction when accounts array is empty
  • Unit test: concurrent removeAccount calls
  • Unit test: registerTransaction during initAccounts execution
  • Integration test: high-concurrency publish-nodes with simulated RPC failures

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions