Skip to content

Loading Data

Robert L. Logan IV edited this page Jan 15, 2021 · 3 revisions

AutoPrompt is mainly used in a masked language modeling setting, however most datasets contain data that has some structure to it, e.g., data that is comprised of multiple fields, has integer or boolean labels, etc. In order to prepare this data for use with a masked language model, AutoPrompt employs the following pipeline:

  1. Data is loaded from a file using a preprocessor that converts each instance into a flat dictionary whose keys are fields and whose values are text. For many datasets that already have a flat structure (e.g., JSONL and text delimited files) this can and will be done automatically. However, in cases where data is nested in a non-trivial manner, or comes in an uncommon format, you may need to write a custom preprocessing function, and add it to autoprompt/preprocessors.py (for example, refer to preprocess_wsc).
  2. The dictionaries are then converted into MLM inputs using a template string. The template string can consist of static test, instance fields in curly brackets, and special [T] and [P] tokens denoting where the triggers and mask tokens should be placed. An example trigger for NLI might look like:
    {premise} [T] [T] {hypothesis}? [P].
    The template and instance fields are converted into a transformers-friendly input by a MultiTokenTemplatizer, which handles the bookkeeping for adding the correct number of masks for multi-token labels, as well as keeping track of which parts of the input correspond to triggers (trigger_mask) and predict tokens (predict_mask). If your task involves non-textual labels, you can define a mapping from labels to text by specifying a label map.

Things to know:

  • A template should only contain one [P].
Clone this wiki locally