-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question:too many indices for tensor of dimension 1 #40
Comments
Hi @Lavi11C , @abertsch72 - do you have an idea regarding In the meantime, I suggest following these instructions: https://github.com/abertsch72/unlimiformer#reproducing-the-experiments-from-the-paper---command-lines Let us know if you have any questions! Best, |
thanks for your replay. |
Yes, you can definitely run with your own datasets, by duplicating the json file and editing it to point to another dataset. Note that your data needs to be in the same Huggingface format, for example: https://huggingface.co/datasets/tau/sled/viewer/gov_report/train You can see how some of the other datasets in our repo (other than Best, |
|
You mean if I wanna use unlimiformer, I have to put my datasets on hugging face first?? I can't input datasets into code from local site, right? |
You don't necessarily need to upload it to the huggingface hub, but you do need to convert it to the Huggingface format, and then just save it to disk and load it from there. |
BTW, my own datasets are all txt file. |
I don't know, you can try. |
I found this way in Internet. And the type is "Dataset". So I think that is ok, doesn't it? |
because I'm not really understand these codes:( there are some paremeters I don't have idea |
You will have to create a dataset that has `input` and `output` fields, in
order to use our code as is.
…On Tue, Sep 19, 2023 at 9:26 AM Lavi11C ***@***.***> wrote:
from argparse import ArgumentParser
import datasets
from tqdm import tqdm
from datasets import load_dataset
def save_dataset(examples, output_dir, split_name, hub_name=None):
subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}')
if output_dir is not None:
subset_dataset.save_to_disk(f'{output_dir}/{split_name}')
print(f'Saved {len(subset_dataset)} {split_name} examples to disk')
if hub_name is not None:
subset_dataset.push_to_hub(hub_name)
print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}')
return subset_dataset
if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument('--input_dataset', required=True, help='dir')
parser.add_argument('--output_dir', required=True, help='dir')
parser.add_argument('--hub_name', required=False)
args = parser.parse_args()
dataset = load_dataset('tau/sled', args.input_dataset)
for split in dataset:
subset = dataset[split]
new_subset = []
for example in tqdm(subset, total=len(subset)):
new_example = {
'id': example['id'],
'pid': example['pid'],
'input': f"Q: {example['input_prefix']}\nText: {example['input']}",
'output': example['output'] if example['output'] is not None else '',
}
new_subset.append(new_example)
save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)
because I'm not really understand these codes:( there are some paremeters
I don't have idea
To examplify, I just have 'text', so I have to write 'text' :
example['text']? then drop other things,, such as 'id', 'pid', etc.
have should I rewrite save_dataset?
—
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADSOXMCRQ3HNSG3EL6VMSHLX3GMO3ANCNFSM6AAAAAA4S7WVP4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
So I can't just upload my dataset into hugging face directly? |
Hi @Lavi11C ! You have a field called 'text' in your dataset, right? You can use the code Uri posted to save a version of the dataset that renames that field to 'input'. Then you can run the Unlimiformer evaluation directly using Alternatively, you can use code like |
sorry, my fault. I don't have field called 'text' in my dataset. I meant my dataset's type is text. And I will add 'input' field when I put my dataset into hugging face. I find this code from ChatGPT. Do you think that work or not to suit your rule?
|
I can run inference-example.py. But when I try to combined it to my own code. I face this problem "too many indices for tensor of dimension 1". I guess in inference-example, the datasets are just one tensor. And my own datasets are about pdf. I think that is one tensor as well. However, I think I ignore somewhere. Would you help me to fix this problem? Thank you very much.
The text was updated successfully, but these errors were encountered: