Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial metadata extraction implementation #159

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jack11wagner
Copy link
Contributor

No description provided.

…mment are found

* This change uses the save_json method in the savers
* Do note: The DiskSaver and S3 saver were using data with the full data dict before
* I changed the Savers and the client to have the parameter for saving json be data["results"] instead of the full data dictionary
* This is a better way of implementing the save_json methods since what is passed in is the json we want to save rather than needing to find the key we
want in the savers themselves.
meta["extraction_status"][file_name] = "Not Attempted"
meta_save_path = f"{meta_save_dir}/extraction-metadata.json"
self.saver.save_json(meta_save_path, meta)
return meta_save_path, meta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does not have to return anything. Otherwise good.

@jack11wagner jack11wagner marked this pull request as draft April 20, 2023 18:30
@jack11wagner
Copy link
Contributor Author

image

Example extraction-metadata.json

@jack11wagner
Copy link
Contributor Author

Not writing to S3 yet.

@jack11wagner
Copy link
Contributor Author

After discussion with Dr. Coleman as well as understanding the time constraints at the end of the semester. We are putting the task of having extraction metadata off for now. At the moment, extraction metadata is a json of all the attachments in a docket and the statuses are initialized to "Not Attempted". We never got to having the extractor update this json due to the challenges with loading the file and the problems with this change in S3.

@jack11wagner
Copy link
Contributor Author

The idea of having extraction metadata is still unclear about what exactly would be needed so this PR will need to be revisited at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants