-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Scheduler to Support Relay Chain Block Number Provider #6362
base: master
Are you sure you want to change the base?
Conversation
@@ -1157,24 +1181,30 @@ impl<T: Config> Pallet<T> { | |||
return | |||
} | |||
|
|||
let mut incomplete_since = now + One::one(); | |||
let mut when = IncompleteSince::<T>::take().unwrap_or(now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why would not it work with IncompleteSince
, without the block Queue
?
How we determine the MaxScheduledBlocks
bound?
With the IncompleteSince
we iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?
Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the
IncompleteSince
we iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?
Yes, but then this becomes unbounded in case too many blocks are skipped. The idea behind using the Queue
is to bound this to a sufficient number.
How we determine the MaxScheduledBlocks bound?
This should be determined similar to the existing MaxScheduledPerBlock
?
Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.
There is already a retry mechanism and the task is purged if the retry count is exceeded (even if failed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Queue
not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50
, we can schedule only 50
jobs with distinct schedule time.
The MaxScheduledPerBlock
for me seems simpler to define. Because the block size its exiting constrain the system have. But how many distinct schedule time points you can have is something new.
Retries work in case if a certain task fails while it's function call is being executed (not the scheduler fail). I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Queue not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50, we can schedule only 50 jobs with distinct schedule time
Indeed, I do not find it quite comfortable to run a for
loop with IncompleteSince
when there could be an unknown number of blocks passed between the successive runs. You could always keep the MaxScheduledBlocks
on the higher side that would give you a similar experience?
I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?
But this stays as an issue even in the current implementation? The change here just makes it bounded, so that the scheduling itself is blocked in such a case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can put a quite big bound on the MaxScheduledBlocks
, it is just a vec of block numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince.
I suggest focusing primarily on our use case, where Asset Hub sets up a scheduler with a Relay Chain block provider. If we can solve this problem with minimal code changes, better no code changes, I would prefer that approach. We can document our expectations for the BlockProvider
type next to it's declaration, and if we or someone else encounter the use case you described in the future, we can address it then.
I wouldn’t complicate this pallet for theoretically possible use cases. Instead, we should target this issue for resolution in the next SDK release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we constraint our usecase, then we can
-
1- keep as it was, iterating on every block.
-
2- bring one level faster: a new storage map that maps from
block_number / 256
to abit-array of size 256
. And we iterate on this map, meaning we iterate for each 256 block, so 25.6 minutes. We only have to iterate 56 times to do one day. 1700 for one month. PoV cost is only 256 bit (+keys) for our asset hub usecase. PoV is getting bad if we read many empty value, but I guess it is ok. -
(3- implement a priority queue on top of chunks in a storage map).
I suggest 1 or 2, it is good for almost everything IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest focusing primarily on our use case, where Asset Hub sets up a scheduler with a Relay Chain block provider.
But even in that case can the thing happen that i mentioned. There just is no guarantee as to how much history AH needs to check per block. So we either bound it with a queue or not. Maybe the 2) from Gui is a reasonable center point, but it still removes the property of the scheduler that every scheduled agenda will eventually be serviced since it could miss some.
Like if we only want to change it for AHM then maybe forking the pallet into the runtimes repo would work otherwise we may screw parachain teams by cutting corners here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the 2) from Gui is a reasonable center point, but it still removes the property of the scheduler that every scheduled agenda will eventually be serviced since it could miss some.
I missed this point, what can be missed?
I agree we should remove MaxScheduledBlocks
and MaxStaleTaskAge
and require in the pallet description that performance can significantly decrease if the chain blocks are too far apart.
As a note I think a priority queue on top of chunks should not be so hard to implement, and is probably the best solution.
substrate/frame/scheduler/src/lib.rs
Outdated
|
||
#[pallet::storage] | ||
pub type IncompleteSince<T: Config> = StorageValue<_, BlockNumberFor<T>>; | ||
/// Provider for the block number. Normally this is the `frame_system` pallet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally in what case? Parachain or relay/solo?
substrate/frame/scheduler/src/lib.rs
Outdated
/// Provider for the block number. Normally this is the `frame_system` pallet. | ||
type BlockNumberProvider: BlockNumberProvider; | ||
|
||
/// The maximum number of blocks that can be scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any hints on how to configure this? Parachain teams will read this and not know what number to put.
substrate/frame/scheduler/src/lib.rs
Outdated
#[pallet::constant] | ||
type MaxScheduledBlocks: Get<u32>; | ||
|
||
/// The maximum number of blocks that a task can be stale for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe a hint for a sane default value.
substrate/frame/scheduler/src/lib.rs
Outdated
/// The queue of block numbers that have scheduled agendas. | ||
#[pallet::storage] | ||
pub(crate) type Queue<T: Config> = | ||
StorageValue<_, BoundedVec<BlockNumberFor<T>, T::MaxScheduledBlocks>, ValueQuery>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if one vector is enough? I think the referenda pallet creates an alarm for each ref...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, if I get this, can you elaborate more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay, if I convert it to a vector of vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector of vector would be the same as a vector in regards to PoV.
I think if we need to make a lot of schedule we will have to use multiple storage item
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our use-case (Governance) it is probably fine to have just one vector i guess since we only put a single block number in there. So we could reasonable have 1000 in there or so. But yes if someone uses it for other things then it could cause high PoV on a parachain.
@@ -1157,24 +1181,30 @@ impl<T: Config> Pallet<T> { | |||
return | |||
} | |||
|
|||
let mut incomplete_since = now + One::one(); | |||
let mut when = IncompleteSince::<T>::take().unwrap_or(now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the referenda pallet creates an alarm for every ref to check the voting turnout.
We have a problem with (3.2) case only. On the current version (without Queue) it will eventually handle the overdue blocks (we can even calculate how many blocks it will take, lets say if there is no tasks scheduled in that period).
Depends on how many blocks are produced. I guess when we assume that the parachain will produce blocks at least as fast as it can advance the scheduler then yes.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince
.
Conceptually, I believe that a priority Queue is the right data structure. We try to evaluate an ordered list of tasks by their order. It is exactly what a priority queue is good at. The issue with implementing this as a Vector is obviously the PoV.
Maybe we can implement the Queue as a B Tree? Then we can get the next task in log reads and insert in log writes. And it allows us to do exactly what we want: get the next pending task. It could be PoV optimized by using chunks as well.
To me it just seems that most of the pain here is that we are using the wrong data structure for the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the complexity impact of this change?
I will rather to leave the current scheduler untouched, and add a new one that supports relaychain block number or timestamp. They can share some helper functions if needed.
/// pallet. | ||
type BlockNumberProvider: BlockNumberProvider; | ||
|
||
/// The maximum number of blocks that can be scheduled. Should be equal to 50 if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it has to be 50? and that's a very low number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a recommendation maybe I can phrase it better.
Also it can be 200 if runtime-benchmarks are enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the 50/200 coming from? how will a parachain determine what value to use? what happen if it is too high or too low? there needs to be some guidance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you are probably leaking how a runtime like westend or rococo is implementing this, but it doesn't need to be mentioned here. Any runtime can decide to provide the value that makes sense to them here.
What you should define in a trait is what this parameter means, that's all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler is only used by the governance, if you use it for more than just that you would probably want to raise it.
But i also dont think putting concrete numbers in the docs makes sense. Anyone can pick the values they want, these are just the sane defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The document have to explain how to determine an appropriate value and explain what will went wrong if the value is too high, or too low. Otherwise as a parachain developer that's using scheduler for some other tasks, what number should I use? Does a high number make it safer or low number make it safer? There are no enough details to help devs to make decision.
#[pallet::constant] | ||
type MaxScheduledBlocks: Get<u32>; | ||
|
||
/// The maximum number of blocks that a task can be stale for. Should be equal to 10 if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't make sense to require a configurable value to be a fixed value
substrate/frame/scheduler/src/lib.rs
Outdated
let mut iter = queue.iter(); | ||
let mut to_remove = Vec::new(); // Collect items to remove | ||
|
||
// Iterate and collect items to remove | ||
for _ in 0..index { | ||
iter.next(); | ||
} | ||
for item in iter { | ||
to_remove.push(*item); | ||
} | ||
|
||
// Now remove the collected items | ||
for item in to_remove { | ||
queue.remove(&item); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is going on here, but maybe something like queue.iter().drain().take(index).map(drop);
would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that gave me an error:
error[E0599]: no method named drain found for mutable reference &mut BTreeSet<<... as BlockNumberProvider>::BlockNumber> in the current scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the interface for bounded vec is too annoying you can convert to a vec and convert back. Because the code is difficult to read.
substrate/frame/scheduler/src/lib.rs
Outdated
incomplete_since = incomplete_since.min(when); | ||
let mut index = 0; | ||
let queue_len = queue.len(); | ||
for when in queue.iter().skip(index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can store the iterator and make it mutable to call next
onto it. Something like
while let Some(value) = iter.next() {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error was:
error[E0608]: cannot index into a value of type BTreeSet<<<T as pallet::Config>::BlockNumberProvider as sp_runtime::traits::BlockNumberProvider>::BlockNumber>
But I will try something similar again and see if that works
substrate/frame/scheduler/src/lib.rs
Outdated
let mut when = IncompleteSince::<T>::take().unwrap_or(now); | ||
let mut executed = 0; | ||
let queue = Queue::<T>::get(); | ||
let end_index = match queue.iter().position(|&x| x >= now) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? There are only two cases: Either the first element is >= now or it is not.
We can just check the first element in the loop below and then take it from there.
substrate/frame/scheduler/src/lib.rs
Outdated
return; | ||
}, | ||
}; | ||
if end_index == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The +1 above prevents this from ever happening.
substrate/frame/scheduler/src/lib.rs
Outdated
let queue_len = queue.len(); | ||
for when in queue.iter().skip(index) { | ||
if *when < now.saturating_sub(T::MaxStaleTaskAge::get()) { | ||
Agenda::<T>::remove(*when); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this MaxStaleTaskAge
? I think it should maybe rather be checked on inserting into an agenda instead of when servicing it, since in the service step we have nothing to lose from executing it.
But am not sure if we need it at all.
Am a bit torn here. I often think that just forking the pallets would make our lives easier since it is not a breaking change anymore... but it could mean increased maintenance..- @seadanda @kianenigma any opinion? |
@@ -32,7 +33,10 @@ type SystemCall<T> = frame_system::Call<T>; | |||
type SystemOrigin<T> = <T as frame_system::Config>::RuntimeOrigin; | |||
|
|||
const SEED: u32 = 0; | |||
const BLOCK_NUMBER: u32 = 2; | |||
|
|||
fn block_number<T: Config>() -> u32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming it max_scheduled_blocks
sounds more appropriate
From this thread: #6362 (comment)
I think this structure shouldn't be to much complexity added and will solve most parachain usecase. WDYT |
With async backing/elastic scaling, we break the assumption made in the scheduler and elsewhere that parachain block number represents a regular monotonic clock which can be used as a proxy to a wall clock. IMO the pallet should have had this feature from the start, but since this is the contract we released under I understand why making changes which decrease the benchmarked performance seems like a regression. In terms of functionality nothing changes for people who set the block provider to Maybe we could introduce a feature flag which removes the extra data structure and introduced complexity - not sure how disgusting that would be. Since the benchmarking depends on how it's configured, are we even sure there will be a weight impact if it's configured to use |
All GitHub workflows were cancelled due to failure one of the required jobs. |
no feature flag plz. it is guaranteed to introduce disaster. what are the downside of making a new pallet? I see nothing. |
Step in #6297
This PR adds the ability for the Scheduler pallet to specify its source of the block number. This is needed for the scheduler pallet to work on a parachain which does not produce blocks on a regular schedule, thus can use the relay chain as a block provider. Because blocks are not produced regularly, we cannot make the assumption that the block number increases monotonically, and thus have a new logic via a
Queue
to handle multiple blocks with valid agenda passed between them.This change only needs a migration for the
Queue
:BlockNumberProvider
continues to use the system pallet's block numberrelay chain's block number
However, we would need separate migrations if the deployed pallets are upgraded on an existing parachain, and the
BlockNumberProvider
uses the relay chain block number.Todo