-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop Celo L1 at a specific block #2322
Conversation
Coverage from tests in coverage: 55.1% of statements across all listed packagescoverage: 68.2% of statements in consensus/istanbul coverage: 63.2% of statements in consensus/istanbul/announce coverage: 57.2% of statements in consensus/istanbul/backend coverage: 0.0% of statements in consensus/istanbul/backend/backendtest coverage: 24.3% of statements in consensus/istanbul/backend/internal/replica coverage: 66.3% of statements in consensus/istanbul/core coverage: 50.0% of statements in consensus/istanbul/db coverage: 0.0% of statements in consensus/istanbul/proxy coverage: 64.2% of statements in consensus/istanbul/uptime coverage: 52.4% of statements in consensus/istanbul/validator coverage: 79.2% of statements in consensus/istanbul/validator/random |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all, i would change a few things here.
First it is important to differentiate between two asks:
- Feature is node should not produce (or accept) block after L2 Block
- Feature is node should STOP after L2 block
To me, initially the feature would be (1).
As impl plant than means only 2 places to change:
- Not Produce => modify worker create block block to STOP once it reaches a certain number
- Not accept => Modify "Blockchain.Insert()" and other insert methods on blockchain object to FAIL after L2 block is reached.
What it doesn't mean is triggering the whole shtudown mechanism
core/blockchain.go
Outdated
@@ -811,6 +811,10 @@ func (bc *BlockChain) ExportN(w io.Writer, first uint64, last uint64) error { | |||
// | |||
// Note, this function assumes that the `mu` mutex is held! | |||
func (bc *BlockChain) writeHeadBlock(block *types.Block) { | |||
if bc.Config().IsL2Migration(new(big.Int).Sub(block.Number(), big.NewInt(1))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel there are better places to do this check, and return an error (vs. just log Crit which i'm not sure if it does not have the side effect of killing the process)
I remeber there are 2 entry point to add block here.
- As part of sync process. Blockchain.Insert()
- As part of consensus validators use another method that bypass the first pass of Blockchain, but it is a public method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the log.Crit here, still looking into whether there's a better place to put the check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be many code paths that can call writeHeadBlock including
InsertPreprocessedBlock
reorg (shouldn’t be relevant for Celo as I understand it)
InsertChain
InsertChainWithoutSealVerification
insertSideChain
My reasoning behind calling bc.StopInsert() in writeHeadBlock is just that it seems to be called every time we add a new head block regardless of the code path and so feels safer than adding the check everywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test to check that InsertChain behaves as expected and returns an error when inserting large chains containing the migration block.
Given that StopInsert() can be called multiple times I'm not worried about calling it in a private method, and this approach avoids changes in all the functions that call writeHeadBlock
. Given I'm a bit new to the code, it feels safer to put the checks here so that it's easy to reason about when StopInsert() gets called, and so it gets called as early as possible after the last l1 block is added.
There may be something I'm missing so please let me know if you still prefer a different spot for the checks!
|
|
||
nextBlockNum := new(big.Int).Add(block.Number(), big.NewInt(1)) | ||
if bc.Config().IsL2Migration(nextBlockNum) { | ||
log.Info("The next block is the L2 migration block, stopping block insertion", "currentBlock", block.NumberU64(), "hash", block.Hash(), "nextBlock", nextBlockNum.Uint64()) | ||
bc.StopInsert() | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this, the check up front should be sufficient. Sopping insertion isn't really important unless there is an attempt to insert the next block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I remove this the long test actually starts failing. I was under the impression bc.StopInsert() was essential to call based on what I saw during testing. The function flips an atomic value that is checked at the beginning of numerous functions related to block insertion, including those in HeaderChain like writeHeaders which you asked about above.
@@ -811,6 +811,14 @@ func (bc *BlockChain) ExportN(w io.Writer, first uint64, last uint64) error { | |||
// | |||
// Note, this function assumes that the `mu` mutex is held! | |||
func (bc *BlockChain) writeHeadBlock(block *types.Block) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also core.HeaderChain.writeHeaders
where rawdb.WriteCanonicalHash
is called. I'm just wondering if this could cause any issues. For example we may sync a header later than the stop block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that by calling bc.StopInsert() we prevent the HeaderChain code from continuing to update the head header. writeHeaders
checks something called procInterrupt(), which wraps a call to insertStopped(), which is injected when NewHeaderChain() is called. When we call StopInsert(), it makes procInterrupt() return true in writeHeaders before it updates the head header references etc.
// this check will pass first and log an error. | ||
if bc.Config().IsL2Migration(block.Number()) { | ||
log.Error("Attempt to insert block number >= l2MigrationBlock, stopping block insertion", "block", block.NumberU64(), "hash", block.Hash()) | ||
bc.StopInsert() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should need to call StopInsert here since this check will effectively prevent any additional blocks being added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think StopInsert() is actually preventing some threads from inserting blocks, as removing it below makes the test fail. Specifically, the writeHeaders
function you asked about above checks something called procInterrupt(), which wraps a call to insertStopped(), which is injected when NewHeaderChain() is called. When we call StopInsert(), it makes procInterrupt() return true in writeHeaders before it updates the head header references etc.
This check was in place to fix an error that shows up via command line testing when a node is restarted with the same l2-migration-block number configured after already reaching it. I'll test whether removing this call to StopInsert re-introduces that error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure why we need changes to writeHeadBlock
at all. So I put a panic in for the IsL2Migration
migration case to see when the check is relevant. The panic was triggered by the block fetcher.
It might be more elegant to discard the blocks when the block fetcher receives than rather then preventing the write here. Suggested commit: 90dbc6c
I still don't have the full picture, so let me know if I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karlb that looks pretty neat to me, and the tests seem to work, so I'd be in favour of this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I've pushed a couple of updates to karlb/stop and removed basically everything except the extra flag, the rejection in the validate function and the e2e test. Tests seem to be working, so I think that would be the way to go now. - https://github.com/celo-org/celo-blockchain/pull/2330/files
if sb.chain.Config().IsL2Migration(nextBlockNum) { | ||
sb.logger.Info("The next block is the L2 migration block, stopping announce protocol and closing istanbul backend", "currentBlock", block.NumberU64(), "hash", block.Hash(), "nextBlock", nextBlockNum) | ||
sb.StopAnnouncing() | ||
sb.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is safe to call from here since Close is also called from other threads, it might be enough to add a lock inside Close so that only one thread can be executing in there at any time. Needs a bit more thought though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need to call Close though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like calling Close() may be a bit overkill actually. StopAnnouncing() stops the announce protocol and it looks like Close() basically prevents it from being easily restarted with StartAnnouncing(). I removed it and the e2e test still passed so I'll remove it from the PR
if err != nil { | ||
log.Error("Error while calling engine.StopValidating", "err", err) | ||
if istanbul, ok := w.engine.(*istanbulBackend.Backend); ok { | ||
if istanbul.IsValidating() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the added condition here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the miner.Stop() function gets called when the user (or the e2e test) actually shuts down the node, in which case istanbul.StopValidating() throws an error here since it's already been called (in the code below). Let me know if that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but this is introducing a race condition, since you could still have 2 goroutines calling stop. But it might be good enough for our purposes if the time between the 2 calls to stop is sufficient.
…h command, fixes forkid issue
… logic to shut down entire node by listening to chain head events
Closed in favour of #2330 |
Closes https://github.com/celo-org/celo-blockchain-planning/issues/419
Changes:
'--l2migrationblock' flag added to geth cmd. This specifies what will be the first block of the l2 network, and the block after the last block of the l2 network. This does not affect the forkid or hardfork config.
3 checks are added to stop block production, insertion and communication
mainLoop
inminer/worker.go
is updated to check if the next block is the l2-migration-block every time it starts generating a new block. When the l2-migration-block is reached, it the worker stops itself. This halts block production.writeHeadBlock
incore/blockchain.go
is updated to check if the block it is writing is >= the l2-migation-block before it does anything else, and also whether the block it just wrote is >= l2-migration-block - 1 at the end of the function. If either of these checks passStopInsert()
is called. This prevents any more blocks from being inserted and stops all processes related to block insertion. Normally the check at the end of the function will pass first, but if a node is restarted with the same l2-migration-block configured after already reaching and stopping on l2-migration block, the check at the beginning ofwriteHeadBlock
will pass first and log an error.InsertChain
will return an error if block insertion is stopped during its execution.Commit
inconsensus/istanbulbackend.go
is updated to check if the block that was just committed is the block before the l2-migration-block right before it returns. If the check passes thenStopAnnounce()
and 'Close()` are called. This stops blocks from being shared between nodes via the istanbul announce protocol and frees up associated resources.Other changes:
Tested with e2e tests
go test -v ./e2e_test -run TestStopNetworkAtL2BlockSimple -count 1000 -timeout 1h
go test -v ./e2e_test -run TestStopNetworkAtL2Block
Tested on Alfajores:
Use the following launch.json config
{ "name": "Launch geth afaljores", "type": "go", "request": "launch", "mode": "auto", "program": "./cmd/geth", "args": [ "--alfajores", "--datadir", "<DATADIR>", "--http", "--http.api", "eth,net,web3,debug,txpool", "--syncmode", "full", "--verbosity", "3", "--light.serve", "90", "--light.maxpeers", "1000", "--maxpeers", "1100", "--l2migrationblock", "<MIGRATION_BLOCK>", ] }
Or run with cli command
./build/bin/geth --alfajores --datadir <DATADIR> --http --http.api eth,net,web3,debug,txpool --syncmode full --verbosity 3 --light.serve 90 --light.maxpeers 1000 --maxpeers 1100 --port 30304 --l2migrationblock <MIGRATION_BLOCK>
Check on syncing with
./build/bin/geth attach <DATADIR>/geth.ipc --exec "eth.syncing"
and
./build/bin/geth attach <DATADIR>/geth.ipc --exec "eth.blockNumber"