Skip to content

A comprehensive benchmark for evaluating text-to-film generation performance.

Notifications You must be signed in to change notification settings

showlab/T2F-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 

Repository files navigation

Text-to-Film-Bench (T2F-Bench) 🎬

Text-to-Film-Bench (T2F-Bench) is a comprehensive benchmark designed for narrative-driven, multi-shot AI video generation.

While current benchmarks focus on short, descriptive clips, T2F-Bench shifts the focus to storytelling. We provide a dataset of 100 high-quality prompts derived from iconic cinematic scenes, transformed into professional screenplays by Gemini. This benchmark evaluates the capability of AI models to act as "directors," translating complex scripts into coherent, cinematic, and emotionally resonant long-form videos.


βš–οΈ Evaluation Framework

The evaluation is conducted from the perspective of a Producer auditing an AI-generated film (multi-shot long video).

Weight Distribution Overview

Dimension Weight Focus
1. Narrative & Story Alignment 30% Plot fidelity, logic, elements, and sync.
2. Continuity & Consistency 25% Character, spatial, and visual stability.
3. Cinematic Language 20% Shot design, camera movement, and editing.
4. Visual & Technical Quality 10% Aesthetics vs. technical artifacts.
5. Performance & Acting 15% Emotional depth and physical realism.

1️⃣ Narrative & Story Alignment [30%]

Core Question: Did the "AI Director" successfully capture the "drama" required by the script?

  • 1.1 Plot Integrity & Completeness [20%]: Does the video follow the narrative arc (Beginning β†’ Middle β†’ End)? Are Key Actions present?

  • 1.2 Plot Logicality [20%]: Is there a clear cause-and-effect relationship between actions?

  • 1.3 Element Accuracy [20%]:

  • Characters: Correct count and attributes (gender, age, ethnicity).

  • Costume/Props: Do they match the script? Are key props missing?

  • Environment/Time: Accuracy of setting and time (Day/Night, Int/Ext).

  • 1.4 Style & Tone [20%]: Does the vibe and aesthetic style match the narrative?

  • 1.5 Dialogue & Lip-Sync [20%]:

  • No Ventriloquism: Mouth must move when there is audio/dialogue.

  • No "Phantom Speech": Mouth must remain shut when silent.

  • Sync: Movement must match the duration and cadence of the script.

2️⃣ Continuity & Consistency [25%]

Core Question: How is the "memory" of the model? Are there "breaks" in reality?

  • 2.1 Character Consistency [40%]: Face, hair, and clothing must remain identical across shot changes.

  • 2.2 Spatio-Temporal Logic [35%]:

  • Geographic Anchoring: Fixed room layouts (e.g., windows don't swap sides).

  • Temporal Flow: Continuous lighting and states.

  • 2.3 Visual Tone Consistency [25%]: Consistency in color temp and grain to avoid a "patchwork" feel.

3️⃣ Cinematic Language [20%]

Core Question: Is there intentionality in the shot composition and camera work?

  • 3.1 Shot Design [50%]: Effective use of Wide, Medium, and Close-up shots; natural transitions.

  • 3.2 Camera Movement & Interaction [30%]:

  • Motivation: Is the movement (Pan, Tilt, Dolly) purposeful?

  • Quality: Smooth trajectories; avoiding "random jitter."

  • 3.3 Editing [20%]: Fluid cutting points that hit the "beats" of the action.

4️⃣ Visual Aesthetics & Technical Quality [10%]

Core Question: Is the frame beautiful (ceiling) and is the physics sound (floor)?

  • 4.1 Cinematic Aesthetics [60%]: Adherence to composition rules, lighting depth, and high-fidelity textures.
  • 4.2 Technical Integrity [40%]: Absence of Artifacts, specifically:
  • Clipping: Limbs merging into objects.
  • Physical Fallacies: Anatomy errors, anti-gravity, or "sliding" steps.
  • Rendering: Flickering, noise, or frame tearing.

5️⃣ Performance & Acting [15%]

Core Question: Is the character a "living soul" or a "wooden puppet"?

  • 5.1 Facial Expression & Subtext [40%]: Expressions matching subtext (Anger, Sadness) and micro-expressions (blinking, eye-movement).
  • 5.2 Voice Performance [30%]: Emotional tone and emphasis in speech.
  • 5.3 Physical Acting [30%]: Natural movements following biomechanics and conveying "weight."

🚨 Evaluation Principles

  1. Independence: Each dimension is evaluated separately. High visual quality does not compensate for poor narrative alignment.
  2. Mandatory Failure (Auto-Fail): Severe breakdowns (e.g., "body horror" distortions, background teleportation, or ventriloquism) result in an automatic minimum score (1) for that sub-category.
  3. Distinction First: In Elo comparisons, ties (A=B) should be avoided unless the two clips are truly indistinguishable.

πŸ“‚ Repository Structure

.
β”œβ”€β”€ AI-Film-Eval-Prompt.csv   # 100 multi-shot screenplay prompts
└── README.md                 # Project overview


πŸš€ Getting Started

  1. Generate: Use the scripts in AI-Film-Eval-Prompt.csv as inputs for your model.
  2. Evaluate: Follow the 5-dimension framework to conduct a human-in-the-loop audit.
  3. Benchmark: Compare your results against our provided baselines.

License: Distributed under the MIT License. See LICENSE for more information.

About

A comprehensive benchmark for evaluating text-to-film generation performance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published