The data from regulations.gov is organized into dockets. Each docket contains all documents and comments for one proposed rule. For each docket, document, and comment there is one JSON file containing a variety of information about the element. The documents and comments can contain attachments (typically .doc
, .docx
, or .pdf
files) that are linked as a field in the JSON file.
All elements are identified by an ID. A docket id is made up of:
- agency abbreviation
- year of docket
- docket number
Document IDs and comment IDs are made up of the docket ID followed by a unique number.
The data is organized by agency and docket. Within each docket, the binary data (attachments) are stored separately from the text data so that users can easily download only the text data.
The structure separates the results from multiple text extraction tools.
data
└── <agency>
└── <docket id>
├── binary-<docket id>
│ ├── comments_attachments
│ │ ├── <comment id>_attachement_<counter>.<extension>
│ │ └── ...
│ ├── documents_attachments
│ │ ├── <document id>_attachement_<counter>.<extension>
│ │ └── ...
└── text-<docket id>
├── comments
│ ├── <comment id>.json
│ └── ...
├── comments_extracted_text
│ ├── <tool name>
│ | ├── <comment id>_attachment_<counter>_extracted.txt
│ | └── ...
| └─ ... <other tools>
├── docket
│ ├── <docket id>.json
| └── ...
├── documents
│ ├── <document id>.json
│ ├── <document id>_content.htm
│ └── ...
└── documents_extracted_text
├── <tool name>
| ├── <document id>_content_extracted.txt
| └── ...
└─ ... <other tools>
The USTR
contains a docket id USTR-2015-0010
that holds 1 docket, 4 documents, and 4 comments. Each of the comments has an attachment, and each of the documents have one or more attachments. The tool pikepdf
was used to extract text from these attachments.
This data would be stored in the structure as follows:
USTR
└── USTR-2015-0010
├── binary-USTR-2015-0010
│ ├── comments_attachments
│ │ ├── USTR-2015-0010-0002_attachment_1.pdf
│ │ ├── USTR-2015-0010-0003_attachment_1.pdf
│ │ ├── USTR-2015-0010-0004_attachment_1.pdf
│ │ └── USTR-2015-0010-0005_attachment_1.pdf
│ └── documents_attachments
│ ├── USTR-2015-0010-0001_content.pdf
│ ├── USTR-2015-0010-0015_content.pdf
│ ├── USTR-2015-0010-0016_content.doc
│ ├── USTR-2015-0010-0016_content.pdf
│ ├── USTR-2015-0010-0017_content.doc
│ └── USTR-2015-0010-0017_content.pdf
└── text-USTR-2015-0010
├── comments
│ ├── USTR-2015-0010-0002.json
│ ├── USTR-2015-0010-0003.json
│ ├── USTR-2015-0010-0004.json
│ └── USTR-2015-0010-0005.json
├── comments_extracted_text
│ └── pikepdf
│ ├── USTR-2015-0010-0002_attachment_1_extracted.txt
│ ├── USTR-2015-0010-0003_attachment_1_extracted.txt
│ ├── USTR-2015-0010-0004_attachment_1_extracted.txt
│ └── USTR-2015-0010-0005_attachment_1_extracted.txt
├── docket
│ └── USTR-2015-0010.json
├── documents
│ ├── USTR-2015-0010-0001.json
│ ├── USTR-2015-0010-0001_content.htm
│ ├── USTR-2015-0010-0015.json
│ ├── USTR-2015-0010-0016.json
│ └── USTR-2015-0010-0017.json
└── documents_extracted_text
└── pikepdf
├── USTR-2015-0010-0015_content_extracted.txt
├── USTR-2015-0010-0016_content_extracted.txt
└── USTR-2015-0010-0017_content_extracted.txt
- At the root level, there is a folder,
USTR
for the agency.- In the agency folder there is a folder for the docket ID,
USTR-2015-0010
.- In the docket folder there are two subfolders to separate the binary data and text data called
binary-USTR-2015-0010
andtext-USTR-2015-0010
. - The
binary-USTR-2015-0010
folder contains two subdirectories,comments_attachments
, anddocument_attachments
to hold the attachments for comments and documents, respectively.- The
comments_attachments
folder contains each attachment file named using the comment id followed by the attachment number, such asUSTR-2015-0010-0002_attachment_1.pdf
. - The
documents_attachments
folder contains each attachment file named using the document id followed by the wordcontent
, such asUSTR-2015-0010-0001_content.pdf
andUSTR-2015-0010-0016_content.doc
.
- The
- The
text-USTR-2015-0010
folder contains five subdirectories:docket
,documents
,comments
,comments_extracted_text
, anddocuments_extracted_text
.- The
comments
folder contains a JSON file for each comment, named with the comment ID for each comment, such asUSTR-2015-0010-0002.json
. - The
comments_extracted_text
folder contains a subdirectory for each extraction tool used. In this example, only the toolpikepdf
was used.- The
pikepdf
directory contains one text file for each attachment with extracted text. These files are named with the comment id, attachment number, and the wordextracted
such asUSTR-2015-0010-0002_attachment_1_extracted.txt
- The
- The
docket
folder contains th JSON file for the docket in a file named with the docket ID, suchUSTR-2015-0010.json
. - The
documents
folder contains one JSON for each document along with the HTM file containing the docket text. Both files are named using the document ID such asUSTR-2015-0010-0001.json
andUSTR-2015-0010-0001_content.htm
. - The
documents_extracted_text
folder contains a subdirectory for each extraction tool used. In this example, only the toolpikepdf
was used.- The
pikepdf
directory contains one text file for each attachment with extracted text. These files are named with the document id, document number, and the wordextracted
such asUSTR-2015-0010-0001_content_extracted.txt
- The
- The
- In the docket folder there are two subfolders to separate the binary data and text data called
- In the agency folder there is a folder for the docket ID,