Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: Improvements to the CSV processing of upload for batch sends #1490

Open
ccostino opened this issue Dec 18, 2024 · 0 comments
Open

EPIC: Improvements to the CSV processing of upload for batch sends #1490

ccostino opened this issue Dec 18, 2024 · 0 comments

Comments

@ccostino
Copy link
Contributor

ccostino commented Dec 18, 2024

When a user uploads a CSV for a batch send, it needs to be structured a certain way and contain the correct character encoding (ideally this is just straight UTF-8!). However, we don't have total control over what people upload, so we need to make sure we have the proper safeguards in place.

Some of this we've already accounted for, and there's some specific checks we have in place that look for things like BOMs (byte order marks) and such. We've recently encountered another edge case that wasn't handled by this though, and it resulted in the system attempting to retry parsing a busted CSV file an order of magnitude times more than lines in the file itself, which also resulted in millions of log entries being generated in quick succession.

We need to rethink some of our approach to our CSV processing and come up with a better way for handling user input of this nature. We also need to make sure that the system handles failures and exceptions appropriately when it comes to file processing. Finally, we need to make sure we have clear, plain language instructions for users on the site on what to do and how to properly upload these CSVs (spreadsheets).

First Step

The first step with this epic is to write up an ADR that evaluates our current architecture and approach to CSV processing and proposes an alternative(s) that accounts for the following:

  • Is able to sanitize and structure the input file to a known, valid structure that we want to operate against, regardless of what the original source looked like.
  • If it can't do the sanitization or restructuring, then we error out immediately and provide useful error messages and feedback to the user so they know what to do and how to fix the problem.
  • With the sanitized input file, parse the batch and hand things off; if at any point there's a failure that we can't recover from, we need to make sure the application handles that gracefully, terminates the job immediately and fully, and provides useful error feedback and information to the user. We do not want to retry the job again and log thousands or millions of errors!

Let's also make sure we're really thinking through this: there's the Python CSV documentation itself, the Python Cookbook and Fluent Python O'Reilly books, articles on how to work with CSV files on Geeks for Geeks and Real Python, and any number of other good authoritative sources of info on how to approach this! 🙂

Next Step

Once we have an ADR written, we'll discuss it as a team and figure out what makes sense to do moving forward, then update this epic with links to new users stories for the actual implementation and testing of the work.

@github-project-automation github-project-automation bot moved this to Issue Backlog (3 Months or Further Out) in Notify.gov product board Dec 18, 2024
@ccostino ccostino moved this from Issue Backlog (3 Months or Further Out) to Epics (Less than 3 Months) in Notify.gov product board Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Epics (Less than 3 Months)
Development

No branches or pull requests

1 participant