[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

burtenshaw · 2024-12-02T12:14:31Z

This is a continuation of this: #1059

It implements a pipeline abstraction template that runs on SelfInstruct step and text generation on a dataset of documents. This should help boot strap basic users to build SFT datasets.

from datasets import Dataset
import wikipedia
from distilabel.pipeline import DatasetInstructionResponsePipeline

pipeline = DatasetInstructionResponsePipeline(num_instructions=5)

distiset = pipeline.run(
    use_cache=False,
    dataset=Dataset.from_list(
        [
            {
                "input": wikipedia.page(title="Transfer_learning").content,
            }
        ]
    ),
)

review-notebook-app · 2024-12-02T12:14:38Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

for more information, see https://pre-commit.ci

github-actions · 2024-12-02T12:15:49Z

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1076/

codspeed-hq · 2024-12-02T12:19:03Z

CodSpeed Performance Report

Merging #1076 will not alter performance

_{Comparing feat/dataset-instruction-response-pipeline (f76bc38) with develop (a8588fd)}

Summary

✅ 1 untouched benchmarks

davidberenstein1957 · 2024-12-10T10:05:21Z

@burtenshaw can we get rid of the pipeline.pipeline.run? Also, perhaps we could limit the exposure to different classes with something like the following. Under the hood it can still use the same but we just use different arguments. WDYT?

from datasets import Dataset
import wikipedia
from distilabel.pipeline import InstructionResponsePipeline

pipeline = InstructionResponsePipeline(num_instructions=5)

distiset = pipeline.pipeline.run(
    use_cache=False,
    dataset=Dataset.from_list(
        [
            {
                "input": wikipedia.page(title="Transfer_learning").content,
            }
        ]
    ),
)

src/distilabel/pipeline/templates/dataset_instruction.py

for more information, see https://pre-commit.ci

src/distilabel/pipeline/templates/dataset_instruction.py

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

for more information, see https://pre-commit.ci

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

burtenshaw added 2 commits November 25, 2024 22:23

feat: implement abstraction on pipeline form datasets

ab8c385

docs: update class doc string and examples

81697ca

[pre-commit.ci] auto fixes from pre-commit.com hooks

ff18c78

for more information, see https://pre-commit.ci

burtenshaw requested review from gabrielmbmb and plaguss December 2, 2024 12:14

burtenshaw marked this pull request as draft December 2, 2024 14:16

burtenshaw requested a review from davidberenstein1957 December 10, 2024 09:56

davidberenstein1957 reviewed Dec 10, 2024

View reviewed changes

burtenshaw and others added 2 commits December 16, 2024 12:36

feat: respond to small changes

3266e70

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8f3310

for more information, see https://pre-commit.ci

davidberenstein1957 reviewed Dec 16, 2024

View reviewed changes

src/distilabel/pipeline/templates/dataset_instruction.py Show resolved Hide resolved

burtenshaw and others added 5 commits December 16, 2024 13:23

add kwargs to docstring

45e10f1

Merge branch 'feat/dataset-instruction-response-pipeline' of https://…

6e69361

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

[pre-commit.ci] auto fixes from pre-commit.com hooks

68524f5

for more information, see https://pre-commit.ci

remove notebook

a2b7356

Merge branch 'feat/dataset-instruction-response-pipeline' of https://…

f76bc38

…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline

burtenshaw marked this pull request as ready for review December 16, 2024 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

burtenshaw commented Dec 2, 2024 •

edited

Loading

review-notebook-app bot commented Dec 2, 2024

github-actions bot commented Dec 2, 2024

codspeed-hq bot commented Dec 2, 2024 •

edited

Loading

davidberenstein1957 commented Dec 10, 2024 •

edited

Loading

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

Are you sure you want to change the base?

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

Conversation

burtenshaw commented Dec 2, 2024 • edited Loading

review-notebook-app bot commented Dec 2, 2024

github-actions bot commented Dec 2, 2024

codspeed-hq bot commented Dec 2, 2024 • edited Loading

CodSpeed Performance Report

Merging #1076 will not alter performance

Summary

davidberenstein1957 commented Dec 10, 2024 • edited Loading

burtenshaw commented Dec 2, 2024 •

edited

Loading

codspeed-hq bot commented Dec 2, 2024 •

edited

Loading

davidberenstein1957 commented Dec 10, 2024 •

edited

Loading