Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question:too many indices for tensor of dimension 1 #40

Open
Lavi11C opened this issue Sep 11, 2023 · 16 comments
Open

Question:too many indices for tensor of dimension 1 #40

Lavi11C opened this issue Sep 11, 2023 · 16 comments

Comments

@Lavi11C
Copy link

Lavi11C commented Sep 11, 2023

I can run inference-example.py. But when I try to combined it to my own code. I face this problem "too many indices for tensor of dimension 1". I guess in inference-example, the datasets are just one tensor. And my own datasets are about pdf. I think that is one tensor as well. However, I think I ignore somewhere. Would you help me to fix this problem? Thank you very much.
Screenshot 2023-09-11 at 4 55 19 PM

@urialon
Copy link
Collaborator

urialon commented Sep 13, 2023

Hi @Lavi11C ,
Thank you for your interest in our work?

@abertsch72 - do you have an idea regarding inference_example.py?

In the meantime, I suggest following these instructions: https://github.com/abertsch72/unlimiformer#reproducing-the-experiments-from-the-paper---command-lines
which are fully reproducible.

Let us know if you have any questions!

Best,
Uri

@Lavi11C
Copy link
Author

Lavi11C commented Sep 17, 2023

thanks for your replay.
Do you mean I can run run.py with my datasets by these arguments?
"src/configs/data/gov_report.json " I think this part is about datasets. And my datasets are all txt files. I can replace it directly? or I need to preprocess something before put them in?
Final question, because my project is about detect this file is benign or malicious, I end up to get an accuracy.
After I run run.py. I think I will get bertscore. Is that accuracy? Or I have to write code about accuracy by myself?
Lot of question, thank you to fix the problems for me.

@urialon
Copy link
Collaborator

urialon commented Sep 17, 2023

Yes, you can definitely run with your own datasets, by duplicating the json file and editing it to point to another dataset.

Note that your data needs to be in the same Huggingface format, for example: https://huggingface.co/datasets/tau/sled/viewer/gov_report/train
You will need to create a Huggingface dataset with fields of input and output.
I'm attaching an example that shows how to create such datasets.

You can see how some of the other datasets in our repo (other than GovReport.json) have accuracy as their metric. If you define accuracy in your custom json, it will be measured instead of ROUGE and bertscore.

Best,
Uri

@urialon
Copy link
Collaborator

urialon commented Sep 17, 2023


from argparse import ArgumentParser
import datasets
from tqdm import tqdm
from datasets import load_dataset

def save_dataset(examples, output_dir, split_name, hub_name=None):
    subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}')
    if output_dir is not None:
        subset_dataset.save_to_disk(f'{output_dir}/{split_name}')
        print(f'Saved {len(subset_dataset)} {split_name} examples to disk')
    if hub_name is not None:
        subset_dataset.push_to_hub(hub_name)
        print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}')
    return subset_dataset

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--input_dataset', required=True, help='dir')
    parser.add_argument('--output_dir', required=True, help='dir')
    parser.add_argument('--hub_name', required=False)
    args = parser.parse_args()

    dataset = load_dataset('tau/sled', args.input_dataset)
    
    for split in dataset:
        subset = dataset[split]
        new_subset = []
        for example in tqdm(subset, total=len(subset)):
            new_example = {
                'id': example['id'],
                'pid': example['pid'],
                'input': f"Q: {example['input_prefix']}\nText: {example['input']}",
                'output': example['output'] if example['output'] is not None else '',
            }
            
            new_subset.append(new_example)
        save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)

    

@Lavi11C
Copy link
Author

Lavi11C commented Sep 18, 2023

You mean if I wanna use unlimiformer, I have to put my datasets on hugging face first?? I can't input datasets into code from local site, right?

@urialon
Copy link
Collaborator

urialon commented Sep 18, 2023

You don't necessarily need to upload it to the huggingface hub, but you do need to convert it to the Huggingface format, and then just save it to disk and load it from there.

@Lavi11C
Copy link
Author

Lavi11C commented Sep 19, 2023

BTW, my own datasets are all txt file.
could I just load data like this way"dataset = load_dataset(‘text’, data_files={‘train’: [‘my_text_1.txt’, ‘my_text_2.txt’], ‘test’: ‘my_test_file.txt’})"?

@urialon
Copy link
Collaborator

urialon commented Sep 19, 2023

I don't know, you can try.
But how do you specify the output this way?

@Lavi11C
Copy link
Author

Lavi11C commented Sep 19, 2023

I found this way in Internet. And the type is "Dataset". So I think that is ok, doesn't it?

@Lavi11C
Copy link
Author

Lavi11C commented Sep 19, 2023


from argparse import ArgumentParser
import datasets
from tqdm import tqdm
from datasets import load_dataset

def save_dataset(examples, output_dir, split_name, hub_name=None):
    subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}')
    if output_dir is not None:
        subset_dataset.save_to_disk(f'{output_dir}/{split_name}')
        print(f'Saved {len(subset_dataset)} {split_name} examples to disk')
    if hub_name is not None:
        subset_dataset.push_to_hub(hub_name)
        print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}')
    return subset_dataset

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--input_dataset', required=True, help='dir')
    parser.add_argument('--output_dir', required=True, help='dir')
    parser.add_argument('--hub_name', required=False)
    args = parser.parse_args()

    dataset = load_dataset('tau/sled', args.input_dataset)
    
    for split in dataset:
        subset = dataset[split]
        new_subset = []
        for example in tqdm(subset, total=len(subset)):
            new_example = {
                'id': example['id'],
                'pid': example['pid'],
                'input': f"Q: {example['input_prefix']}\nText: {example['input']}",
                'output': example['output'] if example['output'] is not None else '',
            }
            
            new_subset.append(new_example)
        save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)

    

because I'm not really understand these codes:( there are some paremeters I don't have idea
To examplify, I just have 'text', so I have to write 'text' : example['text']? then drop other things,, such as 'id', 'pid', etc.
Should I rewrite save_dataset?

@urialon
Copy link
Collaborator

urialon commented Sep 21, 2023 via email

@Lavi11C
Copy link
Author

Lavi11C commented Sep 25, 2023

So I can't just upload my dataset into hugging face directly?
I don't understand you, sorry.
I can't just put all .txt file's content into 'input'?
because I don't know why I need put something into 'output' when I use inference-example.py.

@abertsch72
Copy link
Owner

Hi @Lavi11C ! You have a field called 'text' in your dataset, right? You can use the code Uri posted to save a version of the dataset that renames that field to 'input'. Then you can run the Unlimiformer evaluation directly using run.py. If you don't rename the field, then run.py has no way of determining which field to use as the inputs.

Alternatively, you can use code like inference-example.py and implement your own loop through the dataset and evaluation-- then you can specify the input and output fields that you expect directly.

@Lavi11C
Copy link
Author

Lavi11C commented Sep 25, 2023

sorry, my fault. I don't have field called 'text' in my dataset. I meant my dataset's type is text. And I will add 'input' field when I put my dataset into hugging face. I find this code from ChatGPT. Do you think that work or not to suit your rule?
I wonder if that can work, then I can run inference-example.py, right?

dataset = []

folder_path = 'path_to_folder_containing_txt_files'

import os

for filename in os.listdir(folder_path):
    if filename.endswith('.txt'):
        with open(os.path.join(folder_path, filename), 'r') as file:
            content = file.read()
            dataset.append({'input': content})

from datasets import Dataset

my_dataset = Dataset.from_dict(dataset)

my_dataset.push_to_hub('my_dataset_name', use_auth_token='YOUR_AUTH_TOKEN')

@Lavi11C
Copy link
Author

Lavi11C commented Sep 26, 2023

image
This is my datasets on hugging face. And I end to get accuracy. This is a binary(classfication) question. Do you think I can do that by unlimiformer? Because I add 'label', I'm not sure that I follow your rule.

@Lavi11C
Copy link
Author

Lavi11C commented Sep 27, 2023

image
I still face this problem...... I can't understand why.

encoded_train_dataset = tokenizer.batch_encode_plus(
    dataset_train["train"]["input"],
    truncation=True,
    padding='max_length',
    max_length=99999,
    return_tensors="pt"
)

I set this part for tokenizer. Dose the problem happen here? or because I use the 'label' field, that is not allow to use in your code, isn't it?

or your code may not run for dataset, just can run for one file?

sorry for many questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants