Skip to content

Implement a system that read XML files using Java DOM Parser with validation and error handling

Notifications You must be signed in to change notification settings

abeatrix/XMLValidator

Repository files navigation

XMLValidator

Goal: Design a system that read XML files then process the data before feeding the output to a reporting system.

Constrains:

  • receive one record at a time
  • each record contains all accounts for a single user
  • max 100 accounts per user
  • max 100,000 user xml files per day

Requirements / Todos:

  • handle 100 records per second
  • combin all records with customed outputs into one single file
  • send combined file to receiving system at 1am daily
  • receive user records in a single batch file at 2am daily
  • do not print account numbers in output
  • validation
  • HA
  • Logging
  • allow flexible output format: XML or JSON
  • allow addition aggregates
  • allow addition account type with same element hierarchy

Assumptions + Make-Up-Requirements

my assumptions and other missing requirements

  • one record will be process each time
    • result in no concurrency / data race, lock is not required
  • Sorting is not needed
    • no data stucture required
  • Sorting will be handled by reporting system
    • store output as objects in array
  • Updates on records will be handled by reporting system
  • Duplicated records will be handled by reporting system
  • Only accept upload from authorized endpoints
  • Only validated records will be included in output
  • Batch file will be wrapped in <consumers> tag

Part I - describe and diagram overall system design

  1. validation/verification
    1. only authorized clients can upload XML files to their assigned AWS S3 Bucket
    2. using schema stored in container to validate data
    3. invaliad records will not be included in output file
  2. error handling
    1. AWS Batch would generate log when a record does not pass the validation
  3. logging
    1. AWS provides logging in each steps of the flow, from tracking client login time to data processing etc.
  4. performance considerations
    1. allowing clients to upload files directly to S3 without going through MCM's server provide a faster uploading process for clients and minimize the server as the middleman
    2. process files as batch at certain time of the day instead of every time it is received saves on server run time, processing power, memory and cost.
  5. libraries & frameworks:
    1. StAX for parsing XML file / JAXP API for parsing XML file by constructing DOM
    2. Docker
    3. React.js / Angular.js or anything as frontend that allows Client to login and upload files to assigned AWS S3 bucket
    4. Spring Boot backend (for rest API that connects to AWS S3 with cognito)
  6. software & hardware
    1. Docker run a container to launch application with private resources that is securely isolated
  7. cloud services & cloud platform
    1. alternatively we can have the data to be output to DynamoDB that can be shared with the reporting system
    2. AWS Cloud allows flexible output such as XML or JSON files
    3. AWS Transfer Family for authorized client to upload files via FTP to AWS S3
  8. storage
    1. AWS S3

I have worked at an auditor at a reverse mortgage firm before, where I was responsible for processing and auditing loan documents in before inputting required info in an excel file. The excel file would then be sent to the broker at 4pm on business day. The process of designing this system had made me realize how my experience is so relatable to the design process of this system, and made me understand why the excel document I generated daily was needed. As a result, I believe processing the data at a certain time of the day instead of every time it is received will increase the performance on the system as it limits the amount of time the system has to process and read and write the data, restricting the amount of data in and out throughout the process, and this is why I chose AWS Batch service as they would process the data depends on the size of the data before assigning processing power and memory to run the job, as we do not know how many files will we receive per day except the maximum amount. Moreover, having the system to be run on cloud provides a solution with high availability, scalability, flexibility, stability, and accessibility for both the company and clients. The whole process is automated and can be done serverless with AWS cloud services, from client authentication to data processing. It is flexible and scalable because you can assign and adjust processing power according to the data volume. You can create different programs to run on specific files (e.g. single record vs batch record). You can also switch to processing the data whenever it is received anytime, by changing the setting in AWS Batch. You can also choose to have the processed files deleted automatically to save memory in storage.

Part II

  1. 📝 Instructions
    1. Run XMLstaxValidator.java file
  2. describe the code used to parse and process the data
    1. Input xml file and xsd(schema for validation use) file in XMLstaxValidator.java
    2. Run XMLstaxValidator.java program which will compare the xml file with the schema
    3. If there is no error, the program would pass the file to the XMLReader.java program
    4. XMLReader.java uses StAX to parse the data by going through the xml file from top to bottom
    5. then it will print the data in the assigned format to the output.xml file
  3. UML to describe your class hierarchy and design patterns
    1. please refer to the picture below for UML and flowchart
  4. advantages
    1. high scalability
    2. using schema to validate incoming data
    3. does not include invaliad records in output file
    4. schema is flexible, allows addition aggregates in the future using xs:all tag
    5. the same xml reader can be also use to run batch file as they have the same pattern
  5. disadvantages
    1. require a differnt schema (for validation) to run batch file when they have different root tag(we can set the job defination in AWS Batch to run a different program / schema when a batch file comes in at 2am)
  6. reason of choice
    1. I've tried using StAX and JAXP for parsing and processing the data, and found StAX is more suitable in this case as it's faster when the data has already been validated using the schema already. Moreover, the same program can be used to run both individual file and the batch file as it loops through the file from top to bottom before printing the necessary info to the output file, without the need of searching for a certain keyword.

About

Implement a system that read XML files using Java DOM Parser with validation and error handling

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages