Skip to content

Commit 3a9c5c2

Browse files
authored
Merge pull request #260 from posit-dev/feat-set-tbl
feat: Add the `set_tbl()` method and add `set_tbl=` to `yaml_interrogate()`
2 parents 74704e9 + 0177f64 commit 3a9c5c2

File tree

8 files changed

+1502
-22
lines changed

8 files changed

+1502
-22
lines changed

docs/_quarto.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,7 @@ quartodoc:
194194
can split the data based on the validation results (with `get_sundered_data()`).
195195
contents:
196196
- name: Validate.interrogate
197+
- name: Validate.set_tbl
197198
- name: Validate.get_tabular_report
198199
- name: Validate.get_step_report
199200
- name: Validate.get_json_report

docs/user-guide/yaml-reference.qmd

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,40 @@ tbl:
5656
pl.scan_csv("data.csv").filter(pl.col("date") >= "2024-01-01")
5757
```
5858

59+
#### Using Templates with `set_tbl=`
60+
61+
For reusable validation templates that will always use a custom data source via the `set_tbl=`
62+
parameter in `yaml_interrogate()`, the `tbl` field is still required but its value doesn't matter
63+
since it will be overridden. Recommended approaches:
64+
65+
```yaml
66+
# Option 1: Use a valid dataset name (gets overridden anyway)
67+
tbl: small_table # Will be ignored when `set_tbl=` is used
68+
69+
# Option 2: Use YAML null (clearest semantic intent)
70+
tbl: null # Indicates table will be provided via `set_tbl=`
71+
```
72+
73+
When using `yaml_interrogate()` with `set_tbl=`, the validation template becomes fully reusable:
74+
75+
```python
76+
# Define reusable template
77+
template = """
78+
tbl: null # Will be overridden
79+
tbl_name: "Sales Validation"
80+
steps:
81+
- col_exists:
82+
columns: [customer_id, revenue, region]
83+
- col_vals_gt:
84+
columns: [revenue]
85+
value: 0
86+
"""
87+
88+
# Apply to different datasets
89+
q1_result = pb.yaml_interrogate(template, set_tbl=q1_data)
90+
q2_result = pb.yaml_interrogate(template, set_tbl=q2_data)
91+
```
92+
5993
### DataFrame Library (`df_library`)
6094

6195
The `df_library` key controls which DataFrame library is used to load data sources. This parameter
@@ -117,7 +151,7 @@ thresholds:
117151
critical: 0.15 # 15% failure rate triggers critical
118152
```
119153

120-
- values: numbers between 0 and 1 (percentages) or integers (row counts)
154+
- values: numbers between `0` and `1` (percentages) or integers (row counts)
121155
- levels: `warning`, `error`, `critical`
122156

123157
### Global Actions
@@ -477,6 +511,7 @@ For Pandas DataFrames (when using `df_library: pandas`):
477511
```yaml
478512
- specially:
479513
expr: "lambda df: df.assign(is_valid=df['a'] + df['d'] > 0)"
514+
```
480515

481516
## Column Selection Patterns
482517

docs/user-guide/yaml-validation-workflows.qmd

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,136 @@ tbl:
167167
)
168168
```
169169

170+
## Reusable Templates with `set_tbl=`
171+
172+
One of the most powerful features of YAML validation workflows is the ability to create reusable
173+
templates that can be applied to different datasets. Using the `set_tbl=` parameter with
174+
`yaml_interrogate()`, you can define validation logic once and apply it to multiple data sources.
175+
176+
### Creating Validation Templates
177+
178+
When creating templates for use with `set_tbl=`, the `tbl` field is still required but its value
179+
will be overridden. The recommended approach is to use `tbl: null`:
180+
181+
```yaml
182+
tbl: null
183+
tbl_name: "Sales Data Validation Template"
184+
label: "Standard validation checks for sales data"
185+
steps:
186+
- col_exists:
187+
columns: [customer_id, revenue, region, date]
188+
- col_vals_not_null:
189+
columns: [customer_id, revenue]
190+
- col_vals_gt:
191+
columns: [revenue]
192+
value: 0
193+
- col_vals_in_set:
194+
columns: [region]
195+
set: [North, South, East, West]
196+
```
197+
198+
### Applying Templates to Multiple Datasets
199+
200+
Here's a practical example showing how to apply the same validation template to multiple quarterly
201+
datasets, demonstrating the power of reusable YAML configurations:
202+
203+
```{python}
204+
import pointblank as pb
205+
import polars as pl
206+
207+
# Define the template once
208+
sales_template = """
209+
tbl: null # Will be overridden
210+
tbl_name: "Sales Data Validation"
211+
label: "Standard sales validation checks"
212+
thresholds:
213+
warning: 0.05
214+
error: 0.1
215+
steps:
216+
- col_exists:
217+
columns: [customer_id, revenue, region]
218+
- col_vals_not_null:
219+
columns: [customer_id, revenue]
220+
- col_vals_gt:
221+
columns: [revenue]
222+
value: 0
223+
- col_vals_in_set:
224+
columns: [region]
225+
set: [North, South, East, West]
226+
"""
227+
228+
# Create different datasets
229+
q1_data = pl.DataFrame({
230+
"customer_id": [1, 2, 3, 4],
231+
"revenue": [100, 200, 150, 300],
232+
"region": ["North", "South", "East", "West"]
233+
})
234+
235+
q2_data = pl.DataFrame({
236+
"customer_id": [5, 6, 7, 8],
237+
"revenue": [250, 180, 220, 350],
238+
"region": ["South", "North", "West", "East"]
239+
})
240+
241+
# Apply the same template to both datasets
242+
q1_result = pb.yaml_interrogate(sales_template, set_tbl=q1_data)
243+
q2_result = pb.yaml_interrogate(sales_template, set_tbl=q2_data)
244+
245+
print(f"Q1 validation: {all(v.all_passed for v in q1_result.validation_info)}")
246+
print(f"Q2 validation: {all(v.all_passed for v in q2_result.validation_info)}")
247+
```
248+
249+
### Template Best Practices
250+
251+
1. **Use `tbl: null`**: this clearly indicates the template expects a data source to be provided
252+
2. **Include comprehensive metadata**: use `tbl_name`, `label`, and `brief` to make results
253+
self-documenting
254+
3. **Set appropriate thresholds**: define warning/error levels that make sense for your use case
255+
4. **Version control templates**: store templates in your repository alongside your data processing
256+
code
257+
5. **Test with sample data**: validate your templates work with representative datasets
258+
259+
### Common Template Patterns
260+
261+
For API response validation, you can ensure that responses have the expected structure and valid
262+
status codes:
263+
264+
```yaml
265+
tbl: null
266+
tbl_name: "API Response Validation"
267+
brief: "Standard checks for API response data"
268+
steps:
269+
- col_exists:
270+
columns: [user_id, status, timestamp]
271+
- col_vals_in_set:
272+
columns: [status]
273+
set: [success, error, pending]
274+
- col_vals_not_null:
275+
columns: [user_id, timestamp]
276+
```
277+
278+
For file upload validation, you can check file sizes and formats to ensure they meet your
279+
requirements:
280+
281+
```yaml
282+
tbl: null
283+
tbl_name: "File Upload Validation"
284+
steps:
285+
- col_vals_gt:
286+
columns: [file_size]
287+
value: 0
288+
- col_vals_lt:
289+
columns: [file_size]
290+
value: 10485760 # 10MB limit
291+
- col_vals_in_set:
292+
columns: [file_type]
293+
set: [csv, json, xlsx, parquet]
294+
```
295+
296+
This template approach is particularly valuable in data pipelines, ETL processes, and automated
297+
testing scenarios where you need to apply consistent validation logic across multiple similar
298+
datasets.
299+
170300
## Validation Steps
171301

172302
YAML supports all of Pointblank's validation methods. Here are some common patterns:

pointblank/data/api-docs.txt

Lines changed: 49 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9798,7 +9798,7 @@ validation workflows. The `yaml_interrogate()` function can be used to run a val
97989798
YAML strings or files. The `validate_yaml()` function checks if the YAML configuration
97999799
passes its own validity checks.
98009800

9801-
yaml_interrogate(yaml: 'Union[str, Path]') -> 'Validate'
9801+
yaml_interrogate(yaml: 'Union[str, Path]', set_tbl: 'Union[FrameT, Any, None]' = None) -> 'Validate'
98029802
Execute a YAML-based validation workflow.
98039803

98049804
This is the main entry point for YAML-based validation workflows. It takes YAML configuration
@@ -9813,13 +9813,20 @@ Execute a YAML-based validation workflow.
98139813
yaml
98149814
YAML configuration as string or file path. Can be: (1) a YAML string containing the
98159815
validation configuration, or (2) a Path object or string path to a YAML file.
9816+
set_tbl
9817+
An optional table to override the table specified in the YAML configuration. This allows you
9818+
to apply a YAML-defined validation workflow to a different table than what's specified in
9819+
the configuration. If provided, this table will replace the table defined in the YAML's
9820+
`tbl` field before executing the validation workflow. This can be any supported table type
9821+
including DataFrame objects, Ibis table objects, CSV file paths, Parquet file paths, GitHub
9822+
URLs, or database connection strings.
98169823

98179824
Returns
98189825
-------
98199826
Validate
9820-
An instance of the `Validate` class that has been configured based on the YAML input.
9821-
This object contains the results of the validation steps defined in the YAML configuration.
9822-
It includes metadata like table name, label, language, and thresholds if specified.
9827+
An instance of the `Validate` class that has been configured based on the YAML input. This
9828+
object contains the results of the validation steps defined in the YAML configuration. It
9829+
includes metadata like table name, label, language, and thresholds if specified.
98239830

98249831
Raises
98259832
------
@@ -9918,6 +9925,44 @@ Execute a YAML-based validation workflow.
99189925
This approach is particularly useful for storing validation configurations as part of your data
99199926
pipeline or version control system, allowing you to maintain validation rules alongside your
99209927
code.
9928+
9929+
### Using `set_tbl=` to Override the Table
9930+
9931+
The `set_tbl=` parameter allows you to override the table specified in the YAML configuration.
9932+
This is useful when you have a template validation workflow but want to apply it to different
9933+
tables:
9934+
9935+
```python
9936+
import polars as pl
9937+
9938+
# Create a test table with similar structure to small_table
9939+
test_table = pl.DataFrame({
9940+
"date": ["2023-01-01", "2023-01-02", "2023-01-03"],
9941+
"a": [1, 2, 3],
9942+
"b": ["1-abc-123", "2-def-456", "3-ghi-789"],
9943+
"d": [150, 200, 250]
9944+
})
9945+
9946+
# Use the same YAML config but apply it to our test table
9947+
yaml_config = '''
9948+
tbl: small_table # This will be overridden
9949+
tbl_name: Test Table # This name will be used
9950+
steps:
9951+
- col_exists:
9952+
columns: [date, a, b, d]
9953+
- col_vals_gt:
9954+
columns: [d]
9955+
value: 100
9956+
'''
9957+
9958+
# Execute with table override
9959+
result = pb.yaml_interrogate(yaml_config, set_tbl=test_table)
9960+
print(f"Validation applied to: {result.tbl_name}")
9961+
result
9962+
```
9963+
9964+
This feature makes YAML configurations more reusable and flexible, allowing you to define
9965+
validation logic once and apply it to multiple similar tables.
99219966

99229967

99239968
validate_yaml(yaml: 'Union[str, Path]') -> 'None'

0 commit comments

Comments
 (0)