EPIC: Improvements to the CSV processing of upload for batch sends #1490

ccostino · 2024-12-18T16:23:52Z

When a user uploads a CSV for a batch send, it needs to be structured a certain way and contain the correct character encoding (ideally this is just straight UTF-8!). However, we don't have total control over what people upload, so we need to make sure we have the proper safeguards in place.

Some of this we've already accounted for, and there's some specific checks we have in place that look for things like BOMs (byte order marks) and such. We've recently encountered another edge case that wasn't handled by this though, and it resulted in the system attempting to retry parsing a busted CSV file an order of magnitude times more than lines in the file itself, which also resulted in millions of log entries being generated in quick succession.

We need to rethink some of our approach to our CSV processing and come up with a better way for handling user input of this nature. We also need to make sure that the system handles failures and exceptions appropriately when it comes to file processing. Finally, we need to make sure we have clear, plain language instructions for users on the site on what to do and how to properly upload these CSVs (spreadsheets).

First Step

The first step with this epic is to write up an ADR that evaluates our current architecture and approach to CSV processing and proposes an alternative(s) that accounts for the following:

Is able to sanitize and structure the input file to a known, valid structure that we want to operate against, regardless of what the original source looked like.
If it can't do the sanitization or restructuring, then we error out immediately and provide useful error messages and feedback to the user so they know what to do and how to fix the problem.
With the sanitized input file, parse the batch and hand things off; if at any point there's a failure that we can't recover from, we need to make sure the application handles that gracefully, terminates the job immediately and fully, and provides useful error feedback and information to the user. We do not want to retry the job again and log thousands or millions of errors!

Let's also make sure we're really thinking through this: there's the Python CSV documentation itself, the Python Cookbook and Fluent Python O'Reilly books, articles on how to work with CSV files on Geeks for Geeks and Real Python, and any number of other good authoritative sources of info on how to approach this! 🙂

Next Step

Once we have an ADR written, we'll discuss it as a team and figure out what makes sense to do moving forward, then update this epic with links to new users stories for the actual implementation and testing of the work.

The text was updated successfully, but these errors were encountered:

ccostino added engineering epic labels Dec 18, 2024

ccostino added this to Notify.gov product board Dec 18, 2024

github-project-automation bot moved this to Issue Backlog (3 Months or Further Out) in Notify.gov product board Dec 18, 2024

ccostino moved this from Issue Backlog (3 Months or Further Out) to Epics (Less than 3 Months) in Notify.gov product board Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC: Improvements to the CSV processing of upload for batch sends #1490

EPIC: Improvements to the CSV processing of upload for batch sends #1490

ccostino commented Dec 18, 2024 •

edited

Loading

EPIC: Improvements to the CSV processing of upload for batch sends #1490

EPIC: Improvements to the CSV processing of upload for batch sends #1490

Comments

ccostino commented Dec 18, 2024 • edited Loading

First Step

Next Step

ccostino commented Dec 18, 2024 •

edited

Loading