Question:too many indices for tensor of dimension 1 #40

Lavi11C · 2023-09-11T09:09:57Z

I can run inference-example.py. But when I try to combined it to my own code. I face this problem "too many indices for tensor of dimension 1". I guess in inference-example, the datasets are just one tensor. And my own datasets are about pdf. I think that is one tensor as well. However, I think I ignore somewhere. Would you help me to fix this problem? Thank you very much.

urialon · 2023-09-13T13:50:09Z

Hi @Lavi11C ,
Thank you for your interest in our work?

@abertsch72 - do you have an idea regarding inference_example.py?

In the meantime, I suggest following these instructions: https://github.com/abertsch72/unlimiformer#reproducing-the-experiments-from-the-paper---command-lines
which are fully reproducible.

Let us know if you have any questions!

Best,
Uri

Lavi11C · 2023-09-17T06:43:37Z

thanks for your replay.
Do you mean I can run run.py with my datasets by these arguments?
"src/configs/data/gov_report.json " I think this part is about datasets. And my datasets are all txt files. I can replace it directly? or I need to preprocess something before put them in?
Final question, because my project is about detect this file is benign or malicious, I end up to get an accuracy.
After I run run.py. I think I will get bertscore. Is that accuracy? Or I have to write code about accuracy by myself?
Lot of question, thank you to fix the problems for me.

urialon · 2023-09-17T16:53:08Z

Yes, you can definitely run with your own datasets, by duplicating the json file and editing it to point to another dataset.

Note that your data needs to be in the same Huggingface format, for example: https://huggingface.co/datasets/tau/sled/viewer/gov_report/train
You will need to create a Huggingface dataset with fields of input and output.
I'm attaching an example that shows how to create such datasets.

You can see how some of the other datasets in our repo (other than GovReport.json) have accuracy as their metric. If you define accuracy in your custom json, it will be measured instead of ROUGE and bertscore.

Best,
Uri

urialon · 2023-09-17T16:54:47Z


from argparse import ArgumentParser
import datasets
from tqdm import tqdm
from datasets import load_dataset

def save_dataset(examples, output_dir, split_name, hub_name=None):
    subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}')
    if output_dir is not None:
        subset_dataset.save_to_disk(f'{output_dir}/{split_name}')
        print(f'Saved {len(subset_dataset)} {split_name} examples to disk')
    if hub_name is not None:
        subset_dataset.push_to_hub(hub_name)
        print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}')
    return subset_dataset

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--input_dataset', required=True, help='dir')
    parser.add_argument('--output_dir', required=True, help='dir')
    parser.add_argument('--hub_name', required=False)
    args = parser.parse_args()

    dataset = load_dataset('tau/sled', args.input_dataset)
    
    for split in dataset:
        subset = dataset[split]
        new_subset = []
        for example in tqdm(subset, total=len(subset)):
            new_example = {
                'id': example['id'],
                'pid': example['pid'],
                'input': f"Q: {example['input_prefix']}\nText: {example['input']}",
                'output': example['output'] if example['output'] is not None else '',
            }
            
            new_subset.append(new_example)
        save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)

Lavi11C · 2023-09-18T12:06:27Z

You mean if I wanna use unlimiformer, I have to put my datasets on hugging face first?? I can't input datasets into code from local site, right?

urialon · 2023-09-18T15:33:28Z

You don't necessarily need to upload it to the huggingface hub, but you do need to convert it to the Huggingface format, and then just save it to disk and load it from there.

Lavi11C · 2023-09-19T11:29:05Z

BTW, my own datasets are all txt file.
could I just load data like this way"dataset = load_dataset(‘text’, data_files={‘train’: [‘my_text_1.txt’, ‘my_text_2.txt’], ‘test’: ‘my_test_file.txt’})"?

urialon · 2023-09-19T12:08:17Z

I don't know, you can try.
But how do you specify the output this way?

Lavi11C · 2023-09-19T12:11:24Z

I found this way in Internet. And the type is "Dataset". So I think that is ok, doesn't it?

Lavi11C · 2023-09-19T13:08:54Z


from argparse import ArgumentParser
import datasets
from tqdm import tqdm
from datasets import load_dataset

def save_dataset(examples, output_dir, split_name, hub_name=None):
    subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}')
    if output_dir is not None:
        subset_dataset.save_to_disk(f'{output_dir}/{split_name}')
        print(f'Saved {len(subset_dataset)} {split_name} examples to disk')
    if hub_name is not None:
        subset_dataset.push_to_hub(hub_name)
        print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}')
    return subset_dataset

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--input_dataset', required=True, help='dir')
    parser.add_argument('--output_dir', required=True, help='dir')
    parser.add_argument('--hub_name', required=False)
    args = parser.parse_args()

    dataset = load_dataset('tau/sled', args.input_dataset)
    
    for split in dataset:
        subset = dataset[split]
        new_subset = []
        for example in tqdm(subset, total=len(subset)):
            new_example = {
                'id': example['id'],
                'pid': example['pid'],
                'input': f"Q: {example['input_prefix']}\nText: {example['input']}",
                'output': example['output'] if example['output'] is not None else '',
            }
            
            new_subset.append(new_example)
        save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)

because I'm not really understand these codes:( there are some paremeters I don't have idea
To examplify, I just have 'text', so I have to write 'text' : example['text']? then drop other things,, such as 'id', 'pid', etc.
Should I rewrite save_dataset?

urialon · 2023-09-21T15:01:05Z

You will have to create a dataset that has `input` and `output` fields, in order to use our code as is.

…

On Tue, Sep 19, 2023 at 9:26 AM Lavi11C ***@***.***> wrote: from argparse import ArgumentParser import datasets from tqdm import tqdm from datasets import load_dataset def save_dataset(examples, output_dir, split_name, hub_name=None): subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}') if output_dir is not None: subset_dataset.save_to_disk(f'{output_dir}/{split_name}') print(f'Saved {len(subset_dataset)} {split_name} examples to disk') if hub_name is not None: subset_dataset.push_to_hub(hub_name) print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}') return subset_dataset if __name__ == '__main__': parser = ArgumentParser() parser.add_argument('--input_dataset', required=True, help='dir') parser.add_argument('--output_dir', required=True, help='dir') parser.add_argument('--hub_name', required=False) args = parser.parse_args() dataset = load_dataset('tau/sled', args.input_dataset) for split in dataset: subset = dataset[split] new_subset = [] for example in tqdm(subset, total=len(subset)): new_example = { 'id': example['id'], 'pid': example['pid'], 'input': f"Q: {example['input_prefix']}\nText: {example['input']}", 'output': example['output'] if example['output'] is not None else '', } new_subset.append(new_example) save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name) because I'm not really understand these codes:( there are some paremeters I don't have idea To examplify, I just have 'text', so I have to write 'text' : example['text']? then drop other things,, such as 'id', 'pid', etc. have should I rewrite save_dataset? — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSOXMCRQ3HNSG3EL6VMSHLX3GMO3ANCNFSM6AAAAAA4S7WVP4> . You are receiving this because you commented.Message ID: ***@***.***>

Lavi11C · 2023-09-25T15:26:59Z

So I can't just upload my dataset into hugging face directly?
I don't understand you, sorry.
I can't just put all .txt file's content into 'input'?
because I don't know why I need put something into 'output' when I use inference-example.py.

abertsch72 · 2023-09-25T16:05:37Z

Hi @Lavi11C ! You have a field called 'text' in your dataset, right? You can use the code Uri posted to save a version of the dataset that renames that field to 'input'. Then you can run the Unlimiformer evaluation directly using run.py. If you don't rename the field, then run.py has no way of determining which field to use as the inputs.

Alternatively, you can use code like inference-example.py and implement your own loop through the dataset and evaluation-- then you can specify the input and output fields that you expect directly.

Lavi11C · 2023-09-25T18:42:01Z

sorry, my fault. I don't have field called 'text' in my dataset. I meant my dataset's type is text. And I will add 'input' field when I put my dataset into hugging face. I find this code from ChatGPT. Do you think that work or not to suit your rule?
I wonder if that can work, then I can run inference-example.py, right?

dataset = []

folder_path = 'path_to_folder_containing_txt_files'

import os

for filename in os.listdir(folder_path):
    if filename.endswith('.txt'):
        with open(os.path.join(folder_path, filename), 'r') as file:
            content = file.read()
            dataset.append({'input': content})

from datasets import Dataset

my_dataset = Dataset.from_dict(dataset)

my_dataset.push_to_hub('my_dataset_name', use_auth_token='YOUR_AUTH_TOKEN')

Lavi11C · 2023-09-26T14:26:51Z

This is my datasets on hugging face. And I end to get accuracy. This is a binary(classfication) question. Do you think I can do that by unlimiformer? Because I add 'label', I'm not sure that I follow your rule.

Lavi11C · 2023-09-27T10:06:49Z

I still face this problem...... I can't understand why.

encoded_train_dataset = tokenizer.batch_encode_plus(
    dataset_train["train"]["input"],
    truncation=True,
    padding='max_length',
    max_length=99999,
    return_tensors="pt"
)

I set this part for tokenizer. Dose the problem happen here? or because I use the 'label' field, that is not allow to use in your code, isn't it?

or your code may not run for dataset, just can run for one file?

sorry for many questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question:too many indices for tensor of dimension 1 #40

Question:too many indices for tensor of dimension 1 #40

Lavi11C commented Sep 11, 2023 •

edited

Loading

urialon commented Sep 13, 2023

Lavi11C commented Sep 17, 2023

urialon commented Sep 17, 2023

urialon commented Sep 17, 2023

Lavi11C commented Sep 18, 2023 •

edited

Loading

urialon commented Sep 18, 2023

Lavi11C commented Sep 19, 2023

urialon commented Sep 19, 2023

Lavi11C commented Sep 19, 2023

Lavi11C commented Sep 19, 2023 •

edited

Loading

urialon commented Sep 21, 2023 via email

Lavi11C commented Sep 25, 2023 •

edited

Loading

abertsch72 commented Sep 25, 2023

Lavi11C commented Sep 25, 2023 •

edited

Loading

Lavi11C commented Sep 26, 2023 •

edited

Loading

Lavi11C commented Sep 27, 2023 •

edited

Loading

Question:too many indices for tensor of dimension 1 #40

Question:too many indices for tensor of dimension 1 #40

Comments

Lavi11C commented Sep 11, 2023 • edited Loading

urialon commented Sep 13, 2023

Lavi11C commented Sep 17, 2023

urialon commented Sep 17, 2023

urialon commented Sep 17, 2023

Lavi11C commented Sep 18, 2023 • edited Loading

urialon commented Sep 18, 2023

Lavi11C commented Sep 19, 2023

urialon commented Sep 19, 2023

Lavi11C commented Sep 19, 2023

Lavi11C commented Sep 19, 2023 • edited Loading

urialon commented Sep 21, 2023 via email

Lavi11C commented Sep 25, 2023 • edited Loading

abertsch72 commented Sep 25, 2023

Lavi11C commented Sep 25, 2023 • edited Loading

Lavi11C commented Sep 26, 2023 • edited Loading

Lavi11C commented Sep 27, 2023 • edited Loading

Lavi11C commented Sep 11, 2023 •

edited

Loading

Lavi11C commented Sep 18, 2023 •

edited

Loading

Lavi11C commented Sep 19, 2023 •

edited

Loading

Lavi11C commented Sep 25, 2023 •

edited

Loading

Lavi11C commented Sep 25, 2023 •

edited

Loading

Lavi11C commented Sep 26, 2023 •

edited

Loading

Lavi11C commented Sep 27, 2023 •

edited

Loading