Skip to content

Conversation

@sitammeur
Copy link
Contributor

@sitammeur sitammeur commented Feb 5, 2026

Summary by CodeRabbit

  • New Features
    • Interactive web UI to run end-to-end form-filling workflows with live logs and PDF preview/download
    • Document scanning (OCR), intelligent data transformation, and PDF form population
    • CLI/programmatic entrypoint to run the workflow
    • Built-in W-9 form schema and schema listing/loading support
    • Tools for OCR scanning and form-filling using external OCR/form services
    • Example configuration (.env) and project metadata plus detailed README documentation

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

Adds a new form-filling automation system: typed models, Datalab-based document scanning and PDF filling tools, a multi-stage CrewAI flow (scan → transform → fill), a Streamlit UI with event logging, project metadata, and a W-9 schema.

Changes

Cohort / File(s) Summary
Project config & docs
form-filling-automation/.env.example, form-filling-automation/pyproject.toml, form-filling-automation/README.md
Added environment example, project metadata and dependencies, and a detailed README describing setup and usage.
Data models & schemas
form-filling-automation/models.py, form-filling-automation/schemas/w9.yaml
New Pydantic models for form fields, schema, normalized form data, and flow state; loader/list helpers; added W-9 YAML schema.
Tools (Datalab integration)
form-filling-automation/tools.py
Added DocumentScanTool and FormFillTool with input models, Datalab client calls, validation, file handling, and error reporting.
Workflow & orchestration
form-filling-automation/flow.py, form-filling-automation/main.py
Implemented FormFillingFlow (initialize → extract → transform → fill → finalize), agent/task factories, JSON extraction helpers, and run_form_flow entrypoint (with alias).
Streamlit UI & helpers
form-filling-automation/app.py
New Streamlit app with StreamlitLogListener, PDF preview, JSON extraction helper, sidebar for schema/upload, stepwise workflow orchestration and real-time event display.

Sequence Diagram

sequenceDiagram
    participant User
    participant UI as Streamlit UI
    participant Flow as FormFillingFlow
    participant ScanAgent as Scanner Agent
    participant TransformAgent as Transformer Agent
    participant FillAgent as Filler Agent
    participant Datalab as Datalab SDK
    participant FS as Filesystem

    User->>UI: upload document, blank form, select schema
    UI->>Flow: kickoff(document, blank_form, schema)
    Flow->>ScanAgent: run scan task
    ScanAgent->>Datalab: request OCR/convert
    Datalab-->>ScanAgent: extracted_text
    ScanAgent-->>Flow: extracted_text

    Flow->>TransformAgent: run transform task (text + schema)
    TransformAgent->>TransformAgent: map to structured form_data
    TransformAgent-->>Flow: form_data (JSON)

    Flow->>FillAgent: run fill task (form_data + blank_form)
    FillAgent->>Datalab: request PDF form fill
    Datalab->>FS: write filled PDF
    Datalab-->>FillAgent: success + path
    FillAgent-->>Flow: completed_form_path

    Flow-->>UI: result (completed_form_path, form_data)
    UI-->>User: display preview + download
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped through code, nibbling tasks with glee,
Scan, transform, and fill — a trifecta spree,
PDFs glow, fields lined up in a row,
Streamlit shows progress as breezes blow,
Cheers from a rabbit — forms done, ready to go! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.61% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add Form Automation Agentic Workflow' clearly and accurately summarizes the main change: introducing a complete form-filling automation system built with CrewAI agents.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In `@form-filling-automation/app.py`:
- Around line 3-7: The code currently uses the relative string "schemas" which
breaks when the CWD is different; change to resolve the schemas directory
relative to app.py by computing base = os.path.dirname(__file__) and schema_dir
= os.path.join(base, "schemas") (or use Path(__file__).resolve().parent /
"schemas"), then replace any direct "schemas" usages (including the
discovery/loading calls around the code that currently references "schemas" near
the bottom of app.py) to use schema_dir so schema discovery works regardless of
working directory.
- Around line 324-329: The check currently treats any non-empty result.raw as
valid text (in crew.kickoff -> result.raw -> extracted_text), allowing OCR error
strings like "Error: ..." to pass; update the validation to first detect tool
errors by checking for an explicit error field on result (e.g., result.error or
result.status) and, if not available, reject raw strings that match common error
patterns (e.g., start with "Error", "ERR", "Traceback", or contain "exception")
before assigning extracted_text; if an error is detected, raise a ValueError
with the tool error details instead of proceeding.
- Around line 288-295: The code currently builds file paths using
user-controlled filenames (identity_file.name and blank_form_file.name) causing
path traversal risk; update the logic that constructs identity_path and
blank_form_path to use pathlib.Path(identity_file.name).name and
pathlib.Path(blank_form_file.name).name to strip any directory components before
joining with tmp_dir, and add the necessary pathlib import; reference the
variables identity_path, identity_file, blank_form_path, blank_form_file and
tmp_dir when making this change.

In `@form-filling-automation/flow.py`:
- Around line 25-28: The LLM instantiation uses
os.environ.get("OPENROUTER_API_KEY") which returns None when the environment
variable is missing; update the flow.py code around the llm = LLM(...) creation
to validate the OPENROUTER_API_KEY first (either use
os.environ["OPENROUTER_API_KEY"] to let Python raise a clear KeyError or
explicitly check and raise a descriptive error), then pass the validated
non-None key into LLM(...) so runtime failures are deterministic and the error
message clearly identifies the missing environment variable.
- Around line 179-182: The default output path logic in the block that sets
self.state.output_path currently uses string replace on
self.state.blank_form_path which fails for uppercase or missing extensions;
replace this with pathlib.Path handling: import Path, build p =
Path(self.state.blank_form_path), compute new_name = p.stem + "_completed" +
(p.suffix if p.suffix.lower() else ".pdf"), then set self.state.output_path =
str(p.with_name(new_name)); update the code in the same conditional that
references self.state.output_path and self.state.blank_form_path to use these
Path-based symbols.

In `@form-filling-automation/main.py`:
- Around line 3-6: The code coerces a default None output_path into an empty
string and passes that into FormFillingFlow/load_schema calls (see output_path
variable and FormFillingFlow usage around lines you modified), which can produce
invalid file writes; change the call sites so you either leave output_path as
None or compute a deterministic default path (e.g., derived from the loaded
schema name) and only pass a non-empty string when valid—i.e., avoid passing ""
by: if output_path is falsy keep None (do not pass output_path) or replace it
with a clearly derived default before invoking FormFillingFlow or downstream
functions that consume output_path.

In `@form-filling-automation/schemas/w9.yaml`:
- Around line 13-83: The schema currently makes all tax classification
checkboxes (individual_sole_proprietor, c_corporation, s_corporation,
partnership, trust_estate, llc, other) and TIN parts (ssn_1, ssn_2, ssn_3,
ein_1, ein_2) optional, allowing invalid W‑9s; update w9.yaml to require that at
least one classification is present (add an anyOf/oneOf validation referencing
those checkbox fields) and enforce a complete TIN by requiring either all SSN
parts (ssn_1, ssn_2, ssn_3) or all EIN parts (ein_1, ein_2) via an anyOf rule,
or implement equivalent workflow-level validation that checks these same
conditions before form generation.

In `@form-filling-automation/tools.py`:
- Around line 77-96: In _run, add validation for the output_path parameter
before any writes: check that output_path is provided (not None or empty
string), resolve it to a Path, ensure its parent directory exists and is
writable (or create the parent dir if intended), and return a clear error like
"Error: Invalid output_path" if the check fails; update the existing validation
block in the _run method (which already checks form_path, field_data, and
api_key) to perform this output_path validation immediately after those checks
using the output_path variable.
🧹 Nitpick comments (4)
form-filling-automation/models.py (2)

100-110: Non-dict field values are silently dropped.

In from_dict, if field_info is not a dict, it is silently ignored. This could mask malformed input data. Consider logging a warning or raising a validation error for unexpected data structures to aid debugging.

💡 Suggested improvement
     `@classmethod`
     def from_dict(cls, data: Dict[str, Any]) -> "GenericFormData":
         """Create from a dictionary, converting non-string values to strings."""
         normalized = {}
         for field_name, field_info in data.items():
             if isinstance(field_info, dict):
                 normalized[field_name] = {
                     "value": cls._to_string(field_info.get("value", "")),
                     "description": cls._to_string(field_info.get("description", "")),
                 }
+            else:
+                # Handle simple value format (just a value, no description)
+                normalized[field_name] = {
+                    "value": cls._to_string(field_info),
+                    "description": "",
+                }
         return cls(field_data=normalized)

177-203: Consolidate duplicate loops and improve error visibility.

The two near-identical loops for .yaml and .yml files can be combined. Also, the bare except Exception: continue silently swallows all errors, making it difficult to diagnose schema loading issues.

♻️ Proposed refactor
-    for file in schemas_path.glob("*.yaml"):
-        try:
-            schema = FormSchema.from_yaml(str(file))
-            schemas.append(
-                {
-                    "file": file.name,
-                    "path": str(file),
-                    "form_type": schema.form_type,
-                    "form_name": schema.form_name,
-                }
-            )
-        except Exception:
-            continue
-
-    for file in schemas_path.glob("*.yml"):
-        try:
-            schema = FormSchema.from_yaml(str(file))
-            schemas.append(
-                {
-                    "file": file.name,
-                    "path": str(file),
-                    "form_type": schema.form_type,
-                    "form_name": schema.form_name,
-                }
-            )
-        except Exception:
-            continue
+    import itertools
+    import logging
+    
+    logger = logging.getLogger(__name__)
+    
+    yaml_files = itertools.chain(
+        schemas_path.glob("*.yaml"),
+        schemas_path.glob("*.yml"),
+    )
+    for file in yaml_files:
+        try:
+            schema = FormSchema.from_yaml(str(file))
+            schemas.append(
+                {
+                    "file": file.name,
+                    "path": str(file),
+                    "form_type": schema.form_type,
+                    "form_name": schema.form_name,
+                }
+            )
+        except Exception as e:
+            logger.warning(f"Failed to load schema from {file}: {e}")
+            continue
form-filling-automation/flow.py (2)

224-226: Consider defensive null check for form_schema.

While initialize() validates form_schema, accessing it directly without a null check here could cause an AttributeError if the flow state is manipulated unexpectedly. A defensive check would improve robustness.

💡 Suggested improvement
-        schema = self.state.form_schema
+        schema = self.state.form_schema
+        if not schema:
+            self.state.errors.append("Form schema not available")
+            return "failed"
         transformer_agent = create_form_transformer_agent(schema)

340-349: Greedy regex may capture too much text.

The pattern \{[\s\S]*\} is greedy and matches from the first { to the last } in the entire text. If the text contains multiple JSON objects or trailing content, this could capture invalid JSON. The json.loads call will reject it, but using a non-greedy pattern \{[\s\S]*?\} might improve first-match accuracy.

💡 Suggested improvement
         # Try to find raw JSON object
-        brace_pattern = r"\{[\s\S]*\}"
+        brace_pattern = r"\{[\s\S]*?\}"
         matches = re.findall(brace_pattern, text)

Comment on lines +3 to +7
import base64
from datetime import datetime
import json
import os
import tempfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve the schema directory relative to app.py to avoid missing schemas.

Using "schemas" relies on the current working directory. If the app is launched from the repo root (a common streamlit run form-filling-automation/app.py pattern), schema discovery fails and the UI cannot proceed.

🔧 Suggested fix
 import base64
 from datetime import datetime
 import json
 import os
 import tempfile
+from pathlib import Path
...
-    available_schemas = list_available_schemas("schemas")
+    schemas_dir = Path(__file__).resolve().parent / "schemas"
+    available_schemas = list_available_schemas(str(schemas_dir))

Also applies to: 190-191

🤖 Prompt for AI Agents
In `@form-filling-automation/app.py` around lines 3 - 7, The code currently uses
the relative string "schemas" which breaks when the CWD is different; change to
resolve the schemas directory relative to app.py by computing base =
os.path.dirname(__file__) and schema_dir = os.path.join(base, "schemas") (or use
Path(__file__).resolve().parent / "schemas"), then replace any direct "schemas"
usages (including the discovery/loading calls around the code that currently
references "schemas" near the bottom of app.py) to use schema_dir so schema
discovery works regardless of working directory.

Comment on lines +288 to +295
identity_path = os.path.join(tmp_dir, identity_file.name)
with open(identity_path, "wb") as f:
f.write(identity_file.getvalue())

blank_form_path = os.path.join(tmp_dir, blank_form_file.name)
with open(blank_form_path, "wb") as f:
f.write(blank_form_file.getvalue())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitize uploaded filenames to prevent path traversal.

Uploaded file names are user‑controlled. Joining them directly into a temp path allows path traversal (e.g., ../../...) and writes outside the temp directory. Use Path(...).name to strip directory segments. (Requires pathlib.Path import.)

🔒 Suggested fix
-identity_path = os.path.join(tmp_dir, identity_file.name)
+identity_name = Path(identity_file.name).name
+identity_path = os.path.join(tmp_dir, identity_name)
...
-blank_form_path = os.path.join(tmp_dir, blank_form_file.name)
+blank_form_name = Path(blank_form_file.name).name
+blank_form_path = os.path.join(tmp_dir, blank_form_name)
🤖 Prompt for AI Agents
In `@form-filling-automation/app.py` around lines 288 - 295, The code currently
builds file paths using user-controlled filenames (identity_file.name and
blank_form_file.name) causing path traversal risk; update the logic that
constructs identity_path and blank_form_path to use
pathlib.Path(identity_file.name).name and
pathlib.Path(blank_form_file.name).name to strip any directory components before
joining with tmp_dir, and add the necessary pathlib import; reference the
variables identity_path, identity_file, blank_form_path, blank_form_file and
tmp_dir when making this change.

Comment on lines 324 to 329
result = crew.kickoff(inputs={"document_file": identity_path})

extracted_text = result.raw if result.raw else ""
if not extracted_text:
raise ValueError("No text extracted from document")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Detect tool error strings instead of treating them as extracted text.

The OCR tool returns error messages as strings. The current check only verifies non‑empty output, so "Error: ..." passes and the workflow continues as if successful.

🧰 Suggested fix
-            extracted_text = result.raw if result.raw else ""
-            if not extracted_text:
-                raise ValueError("No text extracted from document")
+            extracted_text = result.raw if result.raw else ""
+            if not extracted_text:
+                raise ValueError("No text extracted from document")
+            if extracted_text.strip().startswith("Error:"):
+                raise ValueError(extracted_text)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result = crew.kickoff(inputs={"document_file": identity_path})
extracted_text = result.raw if result.raw else ""
if not extracted_text:
raise ValueError("No text extracted from document")
result = crew.kickoff(inputs={"document_file": identity_path})
extracted_text = result.raw if result.raw else ""
if not extracted_text:
raise ValueError("No text extracted from document")
if extracted_text.strip().startswith("Error:"):
raise ValueError(extracted_text)
🤖 Prompt for AI Agents
In `@form-filling-automation/app.py` around lines 324 - 329, The check currently
treats any non-empty result.raw as valid text (in crew.kickoff -> result.raw ->
extracted_text), allowing OCR error strings like "Error: ..." to pass; update
the validation to first detect tool errors by checking for an explicit error
field on result (e.g., result.error or result.status) and, if not available,
reject raw strings that match common error patterns (e.g., start with "Error",
"ERR", "Traceback", or contain "exception") before assigning extracted_text; if
an error is detected, raise a ValueError with the tool error details instead of
proceeding.

Comment on lines 25 to 28
llm = LLM(
model="openrouter/minimax/minimax-m2.1",
api_key=os.environ.get("OPENROUTER_API_KEY"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing API key will cause runtime failures.

os.environ.get("OPENROUTER_API_KEY") returns None if the environment variable is not set. This could lead to cryptic errors when the LLM is invoked. Consider validating or using os.environ["OPENROUTER_API_KEY"] to fail fast with a clear KeyError.

🛡️ Proposed fix for early validation
+OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
+if not OPENROUTER_API_KEY:
+    raise EnvironmentError("OPENROUTER_API_KEY environment variable is required")
+
 llm = LLM(
     model="openrouter/minimax/minimax-m2.1",
-    api_key=os.environ.get("OPENROUTER_API_KEY"),
+    api_key=OPENROUTER_API_KEY,
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
llm = LLM(
model="openrouter/minimax/minimax-m2.1",
api_key=os.environ.get("OPENROUTER_API_KEY"),
)
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
if not OPENROUTER_API_KEY:
raise EnvironmentError("OPENROUTER_API_KEY environment variable is required")
llm = LLM(
model="openrouter/minimax/minimax-m2.1",
api_key=OPENROUTER_API_KEY,
)
🤖 Prompt for AI Agents
In `@form-filling-automation/flow.py` around lines 25 - 28, The LLM instantiation
uses os.environ.get("OPENROUTER_API_KEY") which returns None when the
environment variable is missing; update the flow.py code around the llm =
LLM(...) creation to validate the OPENROUTER_API_KEY first (either use
os.environ["OPENROUTER_API_KEY"] to let Python raise a clear KeyError or
explicitly check and raise a descriptive error), then pass the validated
non-None key into LLM(...) so runtime failures are deterministic and the error
message clearly identifies the missing environment variable.

Comment on lines +179 to +182
if not self.state.output_path:
self.state.output_path = self.state.blank_form_path.replace(
".pdf", "_completed.pdf"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Default output path generation may fail for non-.pdf files.

The .replace(".pdf", "_completed.pdf") approach won't modify paths that don't contain .pdf (e.g., uppercase .PDF or missing extension), resulting in the output potentially overwriting the input. Consider using Path for safer path manipulation.

🛡️ Proposed fix using Path
         if not self.state.output_path:
-            self.state.output_path = self.state.blank_form_path.replace(
-                ".pdf", "_completed.pdf"
-            )
+            from pathlib import Path
+            blank_path = Path(self.state.blank_form_path)
+            self.state.output_path = str(
+                blank_path.with_stem(blank_path.stem + "_completed")
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not self.state.output_path:
self.state.output_path = self.state.blank_form_path.replace(
".pdf", "_completed.pdf"
)
if not self.state.output_path:
from pathlib import Path
blank_path = Path(self.state.blank_form_path)
self.state.output_path = str(
blank_path.with_stem(blank_path.stem + "_completed")
)
🤖 Prompt for AI Agents
In `@form-filling-automation/flow.py` around lines 179 - 182, The default output
path logic in the block that sets self.state.output_path currently uses string
replace on self.state.blank_form_path which fails for uppercase or missing
extensions; replace this with pathlib.Path handling: import Path, build p =
Path(self.state.blank_form_path), compute new_name = p.stem + "_completed" +
(p.suffix if p.suffix.lower() else ".pdf"), then set self.state.output_path =
str(p.with_name(new_name)); update the code in the same conditional that
references self.state.output_path and self.state.blank_form_path to use these
Path-based symbols.

Comment on lines +3 to +6
from typing import Any, Dict, Optional

from flow import FormFillingFlow
from models import load_schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid passing an empty output_path when the optional argument is omitted.

output_path defaults to None but is coerced to "", which is not a valid file path and can cause downstream fill to fail or write in unexpected locations. Prefer a deterministic default or keep None and let downstream handle it—just don’t pass an empty string.

✅ Suggested fix (derive a default path)
-from typing import Any, Dict, Optional
+from pathlib import Path
+from typing import Any, Dict, Optional
...
-    flow.state.output_path = output_path or ""
+    if output_path is None:
+        output_path = str(
+            Path(blank_form_path).with_name(f"{schema.form_type}_completed.pdf")
+        )
+    flow.state.output_path = output_path

Also applies to: 37-41

🤖 Prompt for AI Agents
In `@form-filling-automation/main.py` around lines 3 - 6, The code coerces a
default None output_path into an empty string and passes that into
FormFillingFlow/load_schema calls (see output_path variable and FormFillingFlow
usage around lines you modified), which can produce invalid file writes; change
the call sites so you either leave output_path as None or compute a
deterministic default path (e.g., derived from the loaded schema name) and only
pass a non-empty string when valid—i.e., avoid passing "" by: if output_path is
falsy keep None (do not pass output_path) or replace it with a clearly derived
default before invoking FormFillingFlow or downstream functions that consume
output_path.

Comment on lines +13 to +83
- name: individual_sole_proprietor
description: "Individual/sole proprietor or single-member LLC checkbox for federal tax classification (Line 3a)"
required: false

- name: c_corporation
description: "C Corporation checkbox for federal tax classification (Line 3b)"
required: false

- name: s_corporation
description: "S Corporation checkbox for federal tax classification (Line 3c)"
required: false

- name: partnership
description: "Partnership checkbox for federal tax classification (Line 3d)"
required: false

- name: trust_estate
description: "Trust/estate checkbox for federal tax classification (Line 3e)"
required: false

- name: llc
description: "LLC checkbox with tax classification code (Line 3f)"
required: false

- name: other
description: "Other federal tax classification (Line 3g)"
required: false

- name: exempt_payee_code
description: "Exempt payee code if applicable (Line 4)"
required: false

- name: fatca_exemption_code
description: "Exemption from FATCA reporting code if applicable (Line 4)"
required: false

- name: address
description: "Address (number, street, and apt. or suite no.) (Line 5)"
required: true

- name: city_state_zip
description: "City, state, and ZIP code (Line 6)"
required: true

- name: requester_name_address
description: "Requester's name and address (optional) (Line 7)"
required: false

- name: account_numbers
description: "List account number(s) here (optional) (Line 8)"
required: false

- name: ssn_1
description: "Social security number first 3 digits (Part I - box 1)"
required: false

- name: ssn_2
description: "Social security number middle 2 digits (Part I - box 2)"
required: false

- name: ssn_3
description: "Social security number last 4 digits (Part I - box 3)"
required: false

- name: ein_1
description: "Employer identification number first 2 digits (Part I - box 1)"
required: false

- name: ein_2
description: "Employer identification number last 7 digits (Part I - box 2)"
required: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Require tax classification and TIN presence to avoid invalid W‑9 outputs.

W‑9 forms require a tax classification and a TIN (SSN or EIN). All classification checkboxes and SSN/EIN parts are optional here, so the workflow can complete without mandatory data and generate an invalid form. Add schema constraints or workflow validation to enforce at least one classification and a complete SSN/EIN before filling.

🤖 Prompt for AI Agents
In `@form-filling-automation/schemas/w9.yaml` around lines 13 - 83, The schema
currently makes all tax classification checkboxes (individual_sole_proprietor,
c_corporation, s_corporation, partnership, trust_estate, llc, other) and TIN
parts (ssn_1, ssn_2, ssn_3, ein_1, ein_2) optional, allowing invalid W‑9s;
update w9.yaml to require that at least one classification is present (add an
anyOf/oneOf validation referencing those checkbox fields) and enforce a complete
TIN by requiring either all SSN parts (ssn_1, ssn_2, ssn_3) or all EIN parts
(ein_1, ein_2) via an anyOf rule, or implement equivalent workflow-level
validation that checks these same conditions before form generation.

Comment on lines +77 to +96
def _run(
self,
form_path: str,
field_data: Dict[str, Any],
output_path: str,
context: str = "Generic form",
) -> str:
# Validate form file exists
if not Path(form_path).exists():
return f"Error: Form file not found: {form_path}"

# Validate field_data is not empty
if not field_data:
return "Error: No field data provided to fill the form"

# Validate API key
api_key = os.environ.get("DATALAB_API_KEY")
if not api_key:
return "Error: DATALAB_API_KEY environment variable not set"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate output_path before attempting to save.

output_path can be empty (e.g., when callers pass None upstream). This results in invalid or unpredictable writes. Guard early with a clear error.

✅ Suggested fix
     def _run(
         self,
         form_path: str,
         field_data: Dict[str, Any],
         output_path: str,
         context: str = "Generic form",
     ) -> str:
+        # Validate output_path
+        if not output_path or not str(output_path).strip():
+            return "Error: output_path must be provided and non-empty"
         # Validate form file exists
         if not Path(form_path).exists():
             return f"Error: Form file not found: {form_path}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _run(
self,
form_path: str,
field_data: Dict[str, Any],
output_path: str,
context: str = "Generic form",
) -> str:
# Validate form file exists
if not Path(form_path).exists():
return f"Error: Form file not found: {form_path}"
# Validate field_data is not empty
if not field_data:
return "Error: No field data provided to fill the form"
# Validate API key
api_key = os.environ.get("DATALAB_API_KEY")
if not api_key:
return "Error: DATALAB_API_KEY environment variable not set"
def _run(
self,
form_path: str,
field_data: Dict[str, Any],
output_path: str,
context: str = "Generic form",
) -> str:
# Validate output_path
if not output_path or not str(output_path).strip():
return "Error: output_path must be provided and non-empty"
# Validate form file exists
if not Path(form_path).exists():
return f"Error: Form file not found: {form_path}"
# Validate field_data is not empty
if not field_data:
return "Error: No field data provided to fill the form"
# Validate API key
api_key = os.environ.get("DATALAB_API_KEY")
if not api_key:
return "Error: DATALAB_API_KEY environment variable not set"
🤖 Prompt for AI Agents
In `@form-filling-automation/tools.py` around lines 77 - 96, In _run, add
validation for the output_path parameter before any writes: check that
output_path is provided (not None or empty string), resolve it to a Path, ensure
its parent directory exists and is writable (or create the parent dir if
intended), and return a clear error like "Error: Invalid output_path" if the
check fails; update the existing validation block in the _run method (which
already checks form_path, field_data, and api_key) to perform this output_path
validation immediately after those checks using the output_path variable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@form-filling-automation/app.py`:
- Around line 408-414: The call to crew.kickoff currently discards its return
value so any error string from FormFillTool is ignored; change the code around
crew.kickoff(inputs={...}) to capture the result (e.g., result =
crew.kickoff(...)), then validate that result indicates success (check for an
error string or falsy/exception payload) and handle failures by
logging/raising/returning an error to the UI instead of proceeding as success;
update any surrounding logic that assumes success (references: crew.kickoff,
FormFillTool, form_data, blank_form_path, output_path) to short-circuit on error
and propagate a clear error message to the caller.

In `@form-filling-automation/flow.py`:
- Around line 279-291: The result of crew.kickoff (the FormFillTool invocation)
is currently ignored, so capture its return value (e.g., result =
crew.kickoff(...)) and validate it: if the tool returns an error string (like
starting with "Error:" or otherwise indicating failure), set self.state.status
to "failed" (or appropriate error state), do not set
self.state.completed_form_path, and return or raise the error message; only on a
successful result should you set self.state.completed_form_path =
self.state.output_path, set self.state.status = "completed", and return "ok".
Ensure you reference and update the existing symbols crew.kickoff,
self.state.completed_form_path, self.state.status, and self.state.output_path
when implementing this check.
🧹 Nitpick comments (3)
form-filling-automation/flow.py (1)

299-313: Consider using structured logging instead of print statements.

The debug output uses print() which doesn't integrate with logging frameworks. For production use, consider logging.info() or similar.

form-filling-automation/app.py (2)

155-183: Duplicated JSON extraction logic.

This function is nearly identical to FormFillingFlow._extract_json in flow.py. Consider extracting to a shared utility module to avoid duplication and ensure consistent behavior.

♻️ Suggested approach

Create a shared utility, e.g., utils.py:

# utils.py
import json
import re

def extract_json_from_text(text: str) -> dict:
    """Extract JSON from text that may contain markdown code blocks."""
    # ... shared implementation

Then import in both flow.py and app.py:

from utils import extract_json_from_text

286-286: Temporary directory is never cleaned up.

tempfile.mkdtemp() creates a directory that persists after the script ends. Consider using a context manager or registering cleanup for long-running deployments.

♻️ Alternative using TemporaryDirectory context manager
# If restructuring is feasible:
with tempfile.TemporaryDirectory(prefix="form_agent_") as tmp_dir:
    # ... workflow code ...

Or register cleanup with atexit if context manager doesn't fit the Streamlit flow.

Comment on lines +408 to +414
crew.kickoff(
inputs={
"form_data": json.dumps(form_data.to_tool_format(), indent=2),
"blank_form_path": blank_form_path,
"output_path": output_path,
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing result validation after form filling.

Similar to the issue in flow.py, the crew.kickoff result is discarded. If FormFillTool returns an error string, the UI will incorrectly report success.

🐛 Proposed fix to validate the fill result
-                crew.kickoff(
+                result = crew.kickoff(
                     inputs={
                         "form_data": json.dumps(form_data.to_tool_format(), indent=2),
                         "blank_form_path": blank_form_path,
                         "output_path": output_path,
                     }
                 )
+
+                # Check for tool errors
+                raw_result = result.raw if result.raw else ""
+                if raw_result.startswith("Error:"):
+                    raise ValueError(raw_result)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
crew.kickoff(
inputs={
"form_data": json.dumps(form_data.to_tool_format(), indent=2),
"blank_form_path": blank_form_path,
"output_path": output_path,
}
)
result = crew.kickoff(
inputs={
"form_data": json.dumps(form_data.to_tool_format(), indent=2),
"blank_form_path": blank_form_path,
"output_path": output_path,
}
)
# Check for tool errors
raw_result = result.raw if result.raw else ""
if raw_result.startswith("Error:"):
raise ValueError(raw_result)
🤖 Prompt for AI Agents
In `@form-filling-automation/app.py` around lines 408 - 414, The call to
crew.kickoff currently discards its return value so any error string from
FormFillTool is ignored; change the code around crew.kickoff(inputs={...}) to
capture the result (e.g., result = crew.kickoff(...)), then validate that result
indicates success (check for an error string or falsy/exception payload) and
handle failures by logging/raising/returning an error to the UI instead of
proceeding as success; update any surrounding logic that assumes success
(references: crew.kickoff, FormFillTool, form_data, blank_form_path,
output_path) to short-circuit on error and propagate a clear error message to
the caller.

Comment on lines +279 to +291
crew.kickoff(
inputs={
"form_data": json.dumps(
self.state.form_data.to_tool_format(), indent=2
),
"blank_form_path": self.state.blank_form_path,
"output_path": self.state.output_path,
}
)

self.state.completed_form_path = self.state.output_path
self.state.status = "completed"
return "ok"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing result validation after form filling.

The crew.kickoff result is discarded, so tool errors (returned as "Error: ..." strings by FormFillTool) go undetected. The flow will report success even if form filling failed.

🐛 Proposed fix to validate the fill result
-        crew.kickoff(
+        result = crew.kickoff(
             inputs={
                 "form_data": json.dumps(
                     self.state.form_data.to_tool_format(), indent=2
                 ),
                 "blank_form_path": self.state.blank_form_path,
                 "output_path": self.state.output_path,
             }
         )
 
+        # Check for tool errors in the result
+        raw_result = result.raw if result.raw else ""
+        if raw_result.startswith("Error:"):
+            self.state.errors.append(raw_result)
+            return "failed"
+
         self.state.completed_form_path = self.state.output_path
         self.state.status = "completed"
         return "ok"
🤖 Prompt for AI Agents
In `@form-filling-automation/flow.py` around lines 279 - 291, The result of
crew.kickoff (the FormFillTool invocation) is currently ignored, so capture its
return value (e.g., result = crew.kickoff(...)) and validate it: if the tool
returns an error string (like starting with "Error:" or otherwise indicating
failure), set self.state.status to "failed" (or appropriate error state), do not
set self.state.completed_form_path, and return or raise the error message; only
on a successful result should you set self.state.completed_form_path =
self.state.output_path, set self.state.status = "completed", and return "ok".
Ensure you reference and update the existing symbols crew.kickoff,
self.state.completed_form_path, self.state.status, and self.state.output_path
when implementing this check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant