10 Jan 17:50

5b5aa4c

2.14.0

What's new?

🚀 Segment Anything Model (SAM)

The Segment Anything Model (SAM) can be used to generate segmentation masks for objects in a scene, given an input image and input points. See here for the full list of pre-converted models. Support for this model was added in #510.

Demo + source code: https://huggingface.co/spaces/Xenova/segment-anything-web

Example: Perform mask generation w/ Xenova/slimsam-77-uniform.

import { SamModel, AutoProcessor, RawImage } from '@xenova/transformers';

const model = await SamModel.from_pretrained('Xenova/slimsam-77-uniform');
const processor = await AutoProcessor.from_pretrained('Xenova/slimsam-77-uniform');

const img_url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/corgi.jpg';
const raw_image = await RawImage.read(img_url);
const input_points = [[[340, 250]]] // 2D localization of a window

const inputs = await processor(raw_image, input_points);
const outputs = await model(inputs);

const masks = await processor.post_process_masks(outputs.pred_masks, inputs.original_sizes, inputs.reshaped_input_sizes);
console.log(masks);
// [
//   Tensor {
//     dims: [ 1, 3, 410, 614 ],
//     type: 'bool',
//     data: Uint8Array(755220) [ ... ],
//     size: 755220
//   }
// ]
const scores = outputs.iou_scores;
console.log(scores);
// Tensor {
//   dims: [ 1, 1, 3 ],
//   type: 'float32',
//   data: Float32Array(3) [
//     0.8350210189819336,
//     0.9786665439605713,
//     0.8379436731338501
//   ],
//   size: 3
// }

You can then visualize the 3 predicted masks with:

const image = RawImage.fromTensor(masks[0][0].mul(255));
image.save('mask.png');

Input image	Visualized output

Next, select the channel with the highest IoU score, which in this case is the second (green) channel. Intersecting this with the original image gives us an isolated version of the subject:

Selected Mask	Intersected

🛠️ Improvements

Add support for processing non-square images w/ ConvNextFeatureExtractor in #503
Encode revision in remote URL by #507

Full Changelog: 2.13.4...2.14.0

Assets 2

04 Jan 17:31

xenova

2.13.4

07df34f

2.13.4

What's new?

Add support for cross-encoder models (+fix token type ids) (#501)

Example: Information Retrieval w/ Xenova/ms-marco-TinyBERT-L-2-v2.

import { AutoTokenizer, AutoModelForSequenceClassification } from '@xenova/transformers';

const model = await AutoModelForSequenceClassification.from_pretrained('Xenova/ms-marco-TinyBERT-L-2-v2');
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/ms-marco-TinyBERT-L-2-v2');

const features = tokenizer(
    ['How many people live in Berlin?', 'How many people live in Berlin?'],
    {
        text_pair: [
            'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.',
            'New York City is famous for the Metropolitan Museum of Art.',
        ],
        padding: true,
        truncation: true,
    }
)

const { logits } = await model(features)
console.log(logits.data);
// quantized:   [ 7.210887908935547, -11.559350967407227 ]
// unquantized: [ 7.235750675201416, -11.562294006347656 ]

Check out the list of pre-converted models here. We also put out a demo for you to try out.

Full Changelog: 2.13.3...2.13.4

Assets 2

04 Jan 00:41

xenova

2.13.3

f3482ba

2.13.3

What's new?

Fix typo in JSDoc in #498
Fix properties on pipelines in #500. Thanks to @wesbos for reporting the issue!

Full Changelog: 2.13.2...2.13.3

Contributors

wesbos

Assets 2

03 Jan 14:57

xenova

2.13.2

733f982

2.13.2

What's new?

This release is a follow-up to #485, with additional intellisense-focused improvements (see PR).

Full Changelog: 2.13.1...2.13.2

Assets 2

03 Jan 11:24

xenova

2.13.1

e8d1236

2.13.1

What's new?

Improve typing of pipeline function in #485. Thanks to @wesbos for the suggestion!

This also means when you hover over the class name, you'll get example code to help you out.

Add phi-1_5 model in #493.

See example code

import { pipeline } from '@xenova/transformers';

// Create a text-generation pipeline
const generator = await pipeline('text-generation', 'Xenova/phi-1_5_dev');

// Construct prompt
const prompt = `\`\`\`py
import math
def print_prime(n):
    """
    Print all primes between 1 and n
    """`;

// Generate text
const result = await generator(prompt, {
  max_new_tokens: 100,
});
console.log(result[0].generated_text);

Results in:

import math
def print_prime(n):
    """
    Print all primes between 1 and n
    """
    primes = []
    for num in range(2, n+1):
        is_prime = True
        for i in range(2, int(math.sqrt(num))+1):
            if num % i == 0:
                is_prime = False
                break
        if is_prime:
            primes.append(num)
    print(primes)

print_prime(20)

Running the code produces the correct result:

[2, 3, 5, 7, 11, 13, 17, 19]

Full Changelog: 2.13.0...2.13.1

Contributors

wesbos

Assets 2

27 Dec 15:00

xenova

2.13.0

61459e3

2.13.0

What's new?

🎄 7 new architectures!

This release adds support for many new multimodal architectures, bringing the total number of supported architectures to 80! 🤯

1. VITS for multilingual text-to-speech across over 1000 languages! (#466)

import { pipeline } from '@xenova/transformers';

// Create English text-to-speech pipeline
const synthesizer = await pipeline('text-to-speech', 'Xenova/mms-tts-eng');

// Generate speech
const output = await synthesizer('I love transformers');
// {
//   audio: Float32Array(26112) [...],
//   sampling_rate: 16000
// }

mms-tts-eng.mp4

See here for the list of available models. To start, we've converted 12 of the ~1140 models on the Hugging Face Hub. If we haven't added the one you wish to use, you can make it web-ready using our conversion script.

2. CLIPSeg for zero-shot image segmentation. (#478)

import { AutoTokenizer, AutoProcessor, CLIPSegForImageSegmentation, RawImage } from '@xenova/transformers';

// Load tokenizer, processor, and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clipseg-rd64-refined');
const processor = await AutoProcessor.from_pretrained('Xenova/clipseg-rd64-refined');
const model = await CLIPSegForImageSegmentation.from_pretrained('Xenova/clipseg-rd64-refined');

// Run tokenization
const texts = ['a glass', 'something to fill', 'wood', 'a jar'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Read image and run processor
const image = await RawImage.read('https://github.com/timojl/clipseg/blob/master/example_image.jpg?raw=true');
const image_inputs = await processor(image);

// Run model with both text and pixel inputs
const { logits } = await model({ ...text_inputs, ...image_inputs });
// logits: Tensor {
//   dims: [4, 352, 352],
//   type: 'float32',
//   data: Float32Array(495616)[ ... ],
//   size: 495616
// }

You can visualize the predictions as follows:

const preds = logits
  .unsqueeze_(1)
  .sigmoid_()
  .mul_(255)
  .round_()
  .to('uint8');

for (let i = 0; i < preds.dims[0]; ++i) {
  const img = RawImage.fromTensor(preds[i]);
  img.save(`prediction_${i}.png`);
}

Original	`"a glass"`	`"something to fill"`	`"wood"`	`"a jar"`