Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding optional arg to BF.INSERT to allow users to check if their bloom filter can reach the desired size #41

Merged
merged 2 commits into from
Jan 30, 2025

Conversation

zackcam
Copy link
Contributor

@zackcam zackcam commented Jan 17, 2025

Overview

This PR is proposing to add an extra optional arg to the BF.INSERT command in order to allow users to know if their creation will be able to scale to what they desire. We will add an optional arg called “VALIDATESCALETO” and validate whether it is possible to achieve X capacity given the bloom object memory usage limit (by default 128MB), fp rate, Tightening, expansion, and scale outs. We also will expand BF.INFO as well to have a new output as well. This output will show what the max capacity of the specified bloom filter is. This field can also be received independently by specifying MAXSCALEDCAPACITY as an arg.

Amending BF.INSERT command

Adding VALIDATESCALETO as an optional arg of BF.INSERT. This argument will allow the user to specify what they think their filter will scale up to. If the capacity that they think it will scale to isn't possible given the specifications provided then we will throw an error. New bf.insert command will look like:

BF.INSERT <key> [CAPACITY capacity] [ERROR fp_error] [EXPANSION expansion] [NOCREATE] [NONSCALING] [VALIDATESCALETO validate_scale_to] [ITEMS item]

Due to the nature of this optional argument, specifying NONSCALING and ATLEASTCAPACITY isn't allowed as this new argument is only useful for seeing if a filter can scale to the desired capacity.
There are three new error messages associated with this:

pub const VALIDATE_SCALE_TO_EXCEEDS_MAX_SIZE: &str =
    "ERR provided VALIDATESCALETO causes bloom object to exceed memory limit";
pub const VALIDATE_SCALE_TO_FALSE_POSITIVE_INVALID: &str =
    "ERR provided VALIDATESCALETO causes false positive to degrade to 0";
pub const NON_SCALING_AND_VALIDATE_SCALE_TO_IS_INVALID: &str =
    "ERR cannot use NONSCALING and VALIDATESCALETO options together";

Amending BF.INFO command

We will add a new return value to the default result from BF.INFO on a default key, the new output will look like:

'Capacity', 100, 'Size', 384, 'Number of filters', 1, 'Number of items inserted', 1, 'Error rate', b'0.01', 'Expansion rate', 2, 'Max scaled capacity', 26214300]

There is also a new optional arg (MAXSCALEDCAPACITY) which can be specified so running `BF.INFO key MAXSCALEDCAPACITY' will return xxxx. Where xxxx is the MAXSCALEDCAPACITY
As this information is only useful for scaling filters this option will be hidden non scaling filters this means that if the user has specified NONSCALING in the creation of their filter we will have not have the Max scaled capacity in the return. We also will not recognize MAXSCALEDCAPACITY as a valid optional arg to the info command in such a case as well.

Testing

Integration Tests
Added checks for these three error messages in test_bloom_command.py as well as happy cases to show that different expansion, fp rates and capacity all are valid when using this new optional arg.

Updated the correctness tests to now check for the new field for a scaling filter when it does a BF.INFO call. For the non scaling correctness test we check that the new field does not appear

Added a more in depth correctness test as well where we initially try to create two filters where the VALIDATESCALETO is 1 more than the MAXSCALEDCAPACITY (one fails due to fp_rate degradation, one exceeds memory limit). We then actually create it with VALIDATESCALETO set to the MAXSCALEDCAPACITY of the two keys. We then do an info call and get the returned MAXSCALEDCAPACITY and add that many items to both bloom filters. From there we then add 1 more item and expect a failure with either a exceeded memory error or fp rate reaches 0 error. At the very end we then compare a new result of a BF.INFO call to make sure that MAXSCALEDCAPACITY has not changed after adding items and scaling out.

Unit tests:
Added a unit tests for the calculate_max_scaled_capacity method with 5 different variations. For each variation it does a happy check where we will not receive an error from the command. We check that we get the correct returned capacity it reached after it went past the desired 'validate_scale_to'. Next we do an error case for each, four of them we expect to fail due to memory limit and one we expect to fail on fp_rate.

Also expanded test_seed to double check that random seed is different each time.

Apendix

Also suggested in the beginning was a second option: For option 2 we will add a new command that would output the capacity that can be reached before having an error thrown from scaling out given the bloom object memory usage limit (by default 128MB), fp rate, Tightening, expansion.

Option 2: Create a new command BF.GETMAXCAPACITY

This would check what the max capacity of a bloom filter is given a capacity, error rate and expansion. This command will look like:

BF.GETMAXCAPACITY [CAPACITY capacity] [ERROR fp_error] [EXPANSION expansion]

This command would then output the capacity that could be reached before having an error thrown from scaling out. This would not create a filter given the specified arguments.

@zackcam zackcam force-pushed the unstable branch 6 times, most recently from e2b5632 to 94bdd4c Compare January 21, 2025 19:20
…om filter can reach the desired size

Signed-off-by: zackcam <[email protected]>
@madolson
Copy link
Member

All things considered, I think the current naming and API arguments make sense. The only alternative I thought of is we can add it BF.INFO, which might make sense either way. We don't need a new command, and diverging on info should be OK since it's written to be extensible.

@zackcam
Copy link
Contributor Author

zackcam commented Jan 22, 2025

All things considered, I think the current naming and API arguments make sense. The only alternative I thought of is we can add it BF.INFO, which might make sense either way. We don't need a new command, and diverging on info should be OK since it's written to be extensible.

Would we want to add to both BF.INFO and BF.INSERT?
I think that adding to both would make the most sense as adding to BF.INSERT can allow start up confirmation of the size they want is possible. While adding to BF.INFO can provide the max capacity information at any point while also showing how big it could get not just if it can get bigger than what the user wanted.

src/bloom/utils.rs Outdated Show resolved Hide resolved
src/bloom/utils.rs Outdated Show resolved Hide resolved
src/bloom/utils.rs Outdated Show resolved Hide resolved
src/bloom/utils.rs Outdated Show resolved Hide resolved
@zackcam zackcam force-pushed the unstable branch 2 times, most recently from 6904b88 to 2b79825 Compare January 28, 2025 22:21
src/bloom/utils.rs Outdated Show resolved Hide resolved
src/bloom/utils.rs Outdated Show resolved Hide resolved
tests/test_bloom_correctness.py Outdated Show resolved Hide resolved
src/bloom/command_handler.rs Outdated Show resolved Hide resolved
src/bloom/command_handler.rs Show resolved Hide resolved
src/bloom/command_handler.rs Outdated Show resolved Hide resolved
src/bloom/command_handler.rs Show resolved Hide resolved
src/bloom/command_handler.rs Outdated Show resolved Hide resolved
src/bloom/command_handler.rs Outdated Show resolved Hide resolved
src/bloom/utils.rs Outdated Show resolved Hide resolved
src/bloom/utils.rs Outdated Show resolved Hide resolved
src/bloom/utils.rs Outdated Show resolved Hide resolved
@zackcam zackcam force-pushed the unstable branch 2 times, most recently from a2e007b to 96794de Compare January 29, 2025 19:06
src/bloom/utils.rs Outdated Show resolved Hide resolved
src/bloom/utils.rs Outdated Show resolved Hide resolved
src/bloom/utils.rs Show resolved Hide resolved
src/bloom/utils.rs Outdated Show resolved Hide resolved
@zackcam zackcam force-pushed the unstable branch 2 times, most recently from 1ecb04b to 82fea4a Compare January 29, 2025 21:09
@@ -462,6 +462,7 @@ pub fn bloom_filter_insert(ctx: &Context, input_args: &[ValkeyString]) -> Valkey
true => (None, true),
false => (Some(configs::FIXED_SEED), false),
};
let mut validate_scale_to = -1;
Copy link
Member

@KarthikSubbarao KarthikSubbarao Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would prefer using None here to make this code more Rust like.

let mut validate_scale_to = None;

Later on, you can use this:

validate_scale_to = match input_args[idx].to_string_lossy().parse::<i64>() {
                    Ok(num) if (BLOOM_CAPACITY_MIN..=BLOOM_CAPACITY_MAX).contains(&num) => Some(num),

And then:

if let Some(validate_scale_to) {

}

@KarthikSubbarao
Copy link
Member

Thank you for the change @zackcam . Merging it in now

@KarthikSubbarao KarthikSubbarao merged commit 14ad3d0 into valkey-io:unstable Jan 30, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants