Hemm: Holistic Evaluation of Multi-modal Generative Models

Hemm is a library for performing comprehensive benchmark of text-to-image diffusion models on image quality and prompt comprehension integrated with Weave, a lightweight toolkit for tracking and evaluating LLM applications, built by Weights & Biases.

Hemm is highly inspired by the following projects:


The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using Weave Evaluation. By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores.

Leaderboards

Leaderboard	Weave Evals
Rendering prompts with Complex Actions	Weave Evals

Installation

First, we recommend you install the PyTorch by visiting pytorch.org/get-started/locally.

git clone https://github.com/wandb/Hemm
cd Hemm
pip install -e ".[core]"

Quickstart

First, you need to publish your evaluation dataset to Weave. Check out this tutorial that shows you how to publish a dataset on your project.

Once you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.

import asyncio
import weave
from hemm.metrics.vqa import MultiModalLLMEvaluationMetric
from hemm.metrics.vqa.judges.mmllm_judges import OpenAIJudge
from hemm.models import DiffusersModel

# Initialize Weave
weave.init(project_name="image-quality-leaderboard")

# The `DiffusersModel` is a `weave.Model` that uses a
# `diffusers.DiffusionPipeline` under the hood.
# You can write your own model `weave.Model` if your
# model is not diffusers compatible.
model = DiffusersModel(
    diffusion_model_name_or_path="stabilityai/stable-diffusion-2-1",
    image_height=1024,
    image_width=1024,
)

# Define the metric
metric = MultiModalLLMEvaluationMetric(judge=OpenAIJudge())

# Get the Weave dataset reference
dataset=weave.ref("Dataset:v2").get()

# Evaluate!
evaluation = weave.Evaluation(dataset=dataset, scorers=[metric])
summary = asyncio.run(evaluation.evaluate(model))

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
hemm		hemm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hemm: Holistic Evaluation of Multi-modal Generative Models

Leaderboards

Installation

Quickstart

About

Releases 1

Packages

Contributors 2

Languages

License

wandb/Hemm

Folders and files

Latest commit

History

Repository files navigation

Hemm: Holistic Evaluation of Multi-modal Generative Models

Leaderboards

Installation

Quickstart

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages