-
Notifications
You must be signed in to change notification settings - Fork 40
Add Transformer.js as a backend for image classification #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
791087b to
605d8ee
Compare
40bdcb9 to
e7dea85
Compare
e7dea85 to
7fabf65
Compare
34f9ae0 to
0eccced
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds Transformer.js as a new backend for image classification in ml5.js, introducing a Vision Transformer model alongside the existing TensorFlow.js-based models. The implementation creates a new ImageClassifierTransformer class that wraps Hugging Face's Transformers.js library and integrates it into the existing imageClassifier API through a factory pattern.
Key changes:
- Added
@huggingface/transformersdependency (v3.7.6) with associated dependencies for ONNX runtime and image processing - Implemented
ImageClassifierTransformerclass with support for WebGPU/WASM inference - Created three example sketches demonstrating single image classification, top-k results, and webcam classification
Reviewed changes
Copilot reviewed 11 out of 14 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| package.json | Added @huggingface/transformers v3.7.6 dependency |
| yarn.lock | Added all transitive dependencies for Transformers.js including ONNX runtime, Sharp, and Protobuf packages |
| webpack.config.js | Added warning suppression for ESM import.meta usage in Transformers.js |
| src/utils/imageUtilities.js | Exported existing drawToCanvas helper function for video-to-canvas conversion |
| src/ImageClassifier/transformer.js | New implementation of Vision Transformer-based image classifier |
| src/ImageClassifier/index.js | Modified factory function to route "VisionTransformer" model name to new implementation |
| examples/imageClassifier-transformer-single-image/* | Example demonstrating single image classification with default top-k |
| examples/imageClassifier-transformer-single-image-topk/* | Example demonstrating custom top-k parameter usage |
| examples/imageClassifier-transformer-webcam/* | Example demonstrating continuous webcam classification with start/stop |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| @@ -0,0 +1,45 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be just for testing - thought I point this out just in case: I don't think it's worth having a separate example just to demonstrate the topk option (documenting this seems sufficient imho)
| let confidence = ""; | ||
|
|
||
| function preload() { | ||
| classifier = ml5.imageClassifier("VisionTransformer", { topK: 2 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere in the codebase, we use topk as an option with lowercase k. Probably good to retain that for consistency?
| Learn more about the ml5.js project: https://ml5js.org/ | ||
| ml5.js license and Code of Conduct: https://github.com/ml5js/ml5-next-gen/blob/main/LICENSE.md | ||
| This example demonstrates detecting objects in a live video through ml5.imageClassifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add using a transformer model (here and in <title>, sketch.js)?
| <script src="sketch.js"></script> | ||
| </body> | ||
| </html> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: extra newline
| <script src="sketch.js"></script> | ||
| </body> | ||
| </html> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: extra newline
| }; | ||
|
|
||
| export default imageClassifier; | ||
| export default imageClassifier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Git prefers to have a newline character at the end of each file (since diff operates on whole lines)
| // WebGPU is very fast, so we can call the next frame immediately | ||
| if (this.device === "webgpu") next(); | ||
| // Wasm is slower, so we wait for 1 second before calling the next frame | ||
| else setTimeout(next, 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this limit WASM to 1 fps (regardless of compute power)? Is there any way to schedule this dynamically instead?
| let confidence = ""; | ||
|
|
||
| function preload() { | ||
| classifier = ml5.imageClassifier("VisionTransformer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally a bit on the fence if "VisionTransformer" is beneficial or not vs "vit-base-patch16-224" ... searching for the former brings up some articles of the general architecture (that by now different models implement) - only the latter lets me know that this was e.g. trained on 14 million images, with 21 thousand classes, and uses a resolution of 224x224.
If we'll be using "VisionTransformer": how about printing the actual name of the model that is being used to the console?
|
Thank you @pearmini - added a few comments to specific lines. I had two general observations: In China, using NYU Shanghai's otherwise excellent VPN, loading the 345 MB model took me 3.3 minutes - which is significantly worse than with Google's tensorflow.js models. (Most users will already have given up at this point.) Curious if this is a known issue (throttling?) in our corner of the woods, or if there are other CDNs in front of HuggingFace we might want to use? In my Chrome, I received a warning message and two error messages in the console while it ran. Ideally, we don't print those, since people using the library might think those are their fault, or something they might need to address..? Is there any way to reduce transformers.js verbosity, or otherwise filtering its output?
|
These console warnings seem to be an ongoing problem with the onnxruntime itself, here's a github issue on the matter: huggingface/transformers.js#270 |


For single image:
For webcam:
Specify options:
TODO