Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixer validator #215

Merged
merged 24 commits into from
Oct 21, 2024
Merged

Mixer validator #215

merged 24 commits into from
Oct 21, 2024

Conversation

mariia-iureva
Copy link
Contributor

This validator ensures that the Mixer job configuration is correct, data is properly aligned, and filters are valid before starting the actual Mixer job.

Configuration Validation

  • Loads and validates the configuration file structure
  • Checks for required fields and correct data types in the config

S3 Path and Permission Validation

  • Validates S3 paths for documents, attributes, and output
  • Checks permissions (read/write) for the specified S3 paths
  • Verifies the existence of parent directories for output paths

Stream Filter Validation

  • Validates filter expressions in the configuration
  • Checks for syntax errors and provides warnings for potential issues

Document and Attribute Alignment

  • Samples a specified number of documents and their corresponding attribute files
  • Validates the alignment between document and attribute files:
    - Checks that document and attribute files have the same number of lines
    - Verifies that both document and attribute files are valid JSONL
    - Ensures required fields are present in both document and attribute files

Filter Execution Simulation

  • Executes filter commands on a sample of attribute files
  • Reports on the number of lines that pass or are excluded by the filters
  • Identifies any errors encountered during filter execution

Attribute Name Validation

  • Cross-references attribute names used in filters with those found in attribute files
  • Identifies missing attributes and potential typos in attribute names
  • Provides a comprehensive list of all attributes found in the files

File Sampling and Analysis

  • Downloads and analyzes a sample of files from the specified S3 paths
  • Counts lines in both document and attribute files
  • Validates the structure and content of sampled files

Reporting and Logging

  • Provides detailed console output for each validation step
  • Warns about potential issues or misconfigurations
  • Summarizes the results of each validation phase

Error Handling and Cleanup

  • Handles errors gracefully and provides informative error messages
  • Cleans up temporary files and directories after validation

@Whattabatt
Copy link
Contributor

You're going to need to add this file's dependencies to pyproject.toml in order for it to run in a clean environment

scripts/validate_mixer.py Outdated Show resolved Hide resolved
scripts/validate_mixer.py Outdated Show resolved Hide resolved
scripts/validate_mixer.py Outdated Show resolved Hide resolved
@mariia-iureva
Copy link
Contributor Author

You're going to need to add this file's dependencies to pyproject.toml in order for it to run in a clean environment

Addressed this one and added dependencies

@Whattabatt
Copy link
Contributor

The warnings produce a lot of noise, it'd be good to accept a 'verbose' flag and only log the warnings if it's set to true.

scripts/validate_mixer.py Outdated Show resolved Hide resolved
print(f"File path type: {type(file_path)}")
return None

def evaluate_comparison(value, op, comparison_value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!= ? Any of the other weirder operators https://www.w3schools.com/python/python_operators.asp ?

scripts/validate_mixer/validator.py Outdated Show resolved Hide resolved
scripts/validate_mixer/config_handler.py Outdated Show resolved Hide resolved
scripts/validate_mixer.py Outdated Show resolved Hide resolved
@mariia-iureva
Copy link
Contributor Author

The warnings produce a lot of noise, it'd be good to accept a 'verbose' flag and only log the warnings if it's set to true.

Added --verbose flag and hid most of the print statements in it

Copy link
Contributor

@Whattabatt Whattabatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should delete the old validate_mixer.py script but otherwise looks good!

@mariia-iureva mariia-iureva merged commit 0c0f10c into main Oct 21, 2024
18 checks passed
@mariia-iureva mariia-iureva deleted the mixer-validator branch October 21, 2024 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants