Skip to content

Conversation

@pearmini
Copy link
Member

@pearmini pearmini commented Nov 13, 2025

For single image:

function preload() {
  classifier = ml5.imageClassifier("VisionTransformer");
  img = loadImage("images/bird.jpg");
}

function setup() {
  createCanvas(400, 400);
  classifier.classify(img, gotResult);
  image(img, 0, 0, width, height);
}

For webcam:

function preload() {
  classifier = ml5.imageClassifier("VisionTransformer");
}

function setup() {
  createCanvas(640, 480);
  video = createCapture(VIDEO);
  video.size(640, 480);
  video.hide();
  classifier.classifyStart(video, gotResult);
}

Specify options:

function preload() {
  classifier = ml5.imageClassifier("VisionTransformer", {dtype: "32",  device: "wasm", topK: 5});
}

TODO

  • Classify single image
  • Handle Video Input
  • Add more examples
  • Support top_k input
  • Docs

@pearmini pearmini marked this pull request as draft November 13, 2025 13:12
@pearmini pearmini force-pushed the image-classifier-transformer-js branch from 791087b to 605d8ee Compare November 20, 2025 13:04
@pearmini pearmini force-pushed the image-classifier-transformer-js branch 2 times, most recently from 40bdcb9 to e7dea85 Compare December 3, 2025 19:36
@pearmini pearmini force-pushed the image-classifier-transformer-js branch from e7dea85 to 7fabf65 Compare December 3, 2025 19:40
@pearmini pearmini force-pushed the image-classifier-transformer-js branch from 34f9ae0 to 0eccced Compare December 4, 2025 03:05
@pearmini pearmini requested a review from Copilot December 4, 2025 03:15
@pearmini pearmini marked this pull request as ready for review December 4, 2025 03:15
@pearmini pearmini requested review from gohai and shiffman December 4, 2025 03:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds Transformer.js as a new backend for image classification in ml5.js, introducing a Vision Transformer model alongside the existing TensorFlow.js-based models. The implementation creates a new ImageClassifierTransformer class that wraps Hugging Face's Transformers.js library and integrates it into the existing imageClassifier API through a factory pattern.

Key changes:

  • Added @huggingface/transformers dependency (v3.7.6) with associated dependencies for ONNX runtime and image processing
  • Implemented ImageClassifierTransformer class with support for WebGPU/WASM inference
  • Created three example sketches demonstrating single image classification, top-k results, and webcam classification

Reviewed changes

Copilot reviewed 11 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
package.json Added @huggingface/transformers v3.7.6 dependency
yarn.lock Added all transitive dependencies for Transformers.js including ONNX runtime, Sharp, and Protobuf packages
webpack.config.js Added warning suppression for ESM import.meta usage in Transformers.js
src/utils/imageUtilities.js Exported existing drawToCanvas helper function for video-to-canvas conversion
src/ImageClassifier/transformer.js New implementation of Vision Transformer-based image classifier
src/ImageClassifier/index.js Modified factory function to route "VisionTransformer" model name to new implementation
examples/imageClassifier-transformer-single-image/* Example demonstrating single image classification with default top-k
examples/imageClassifier-transformer-single-image-topk/* Example demonstrating custom top-k parameter usage
examples/imageClassifier-transformer-webcam/* Example demonstrating continuous webcam classification with start/stop

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI commented Dec 4, 2025

@pearmini I've opened a new pull request, #293, to work on those changes. Once the pull request is ready, I'll request review from you.

pearmini and others added 2 commits December 3, 2025 23:22
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@@ -0,0 +1,45 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be just for testing - thought I point this out just in case: I don't think it's worth having a separate example just to demonstrate the topk option (documenting this seems sufficient imho)

let confidence = "";

function preload() {
classifier = ml5.imageClassifier("VisionTransformer", { topK: 2 });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere in the codebase, we use topk as an option with lowercase k. Probably good to retain that for consistency?

Learn more about the ml5.js project: https://ml5js.org/
ml5.js license and Code of Conduct: https://github.com/ml5js/ml5-next-gen/blob/main/LICENSE.md
This example demonstrates detecting objects in a live video through ml5.imageClassifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add using a transformer model (here and in <title>, sketch.js)?

<script src="sketch.js"></script>
</body>
</html>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: extra newline

<script src="sketch.js"></script>
</body>
</html>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: extra newline

};

export default imageClassifier;
export default imageClassifier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git prefers to have a newline character at the end of each file (since diff operates on whole lines)

// WebGPU is very fast, so we can call the next frame immediately
if (this.device === "webgpu") next();
// Wasm is slower, so we wait for 1 second before calling the next frame
else setTimeout(next, 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this limit WASM to 1 fps (regardless of compute power)? Is there any way to schedule this dynamically instead?

let confidence = "";

function preload() {
classifier = ml5.imageClassifier("VisionTransformer");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally a bit on the fence if "VisionTransformer" is beneficial or not vs "vit-base-patch16-224" ... searching for the former brings up some articles of the general architecture (that by now different models implement) - only the latter lets me know that this was e.g. trained on 14 million images, with 21 thousand classes, and uses a resolution of 224x224.

If we'll be using "VisionTransformer": how about printing the actual name of the model that is being used to the console?

@gohai
Copy link
Member

gohai commented Dec 24, 2025

Thank you @pearmini - added a few comments to specific lines.

I had two general observations:

In China, using NYU Shanghai's otherwise excellent VPN, loading the 345 MB model took me 3.3 minutes - which is significantly worse than with Google's tensorflow.js models. (Most users will already have given up at this point.) Curious if this is a known issue (throttling?) in our corner of the woods, or if there are other CDNs in front of HuggingFace we might want to use?

In my Chrome, I received a warning message and two error messages in the console while it ran. Ideally, we don't print those, since people using the library might think those are their fault, or something they might need to address..? Is there any way to reduce transformers.js verbosity, or otherwise filtering its output?

Screenshot 2025-12-24 at 5 33 41 PM

@nasif-co
Copy link
Contributor

In my Chrome, I received a warning message and two error messages in the console while it ran. Ideally, we don't print those, since people using the library might think those are their fault, or something they might need to address..? Is there any way to reduce transformers.js verbosity, or otherwise filtering its output?

Screenshot 2025-12-24 at 5 33 41 PM

These console warnings seem to be an ongoing problem with the onnxruntime itself, here's a github issue on the matter: huggingface/transformers.js#270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants