Text-to-Film-Bench (T2F-Bench) 🎬

Text-to-Film-Bench (T2F-Bench) is a comprehensive benchmark designed for narrative-driven, multi-shot AI video generation.

While current benchmarks focus on short, descriptive clips, T2F-Bench shifts the focus to storytelling. We provide a dataset of 100 high-quality prompts derived from iconic cinematic scenes, transformed into professional screenplays by Gemini. This benchmark evaluates the capability of AI models to act as "directors," translating complex scripts into coherent, cinematic, and emotionally resonant long-form videos.

⚖️ Evaluation Framework

The evaluation is conducted from the perspective of a Producer auditing an AI-generated film (multi-shot long video).

Weight Distribution Overview

Dimension	Weight	Focus
1. Narrative & Story Alignment	30%	Plot fidelity, logic, elements, and sync.
2. Continuity & Consistency	25%	Character, spatial, and visual stability.
3. Cinematic Language	20%	Shot design, camera movement, and editing.
4. Visual & Technical Quality	10%	Aesthetics vs. technical artifacts.
5. Performance & Acting	15%	Emotional depth and physical realism.

1️⃣ Narrative & Story Alignment [30%]

Core Question: Did the "AI Director" successfully capture the "drama" required by the script?

1.1 Plot Integrity & Completeness [20%]: Does the video follow the narrative arc (Beginning → Middle → End)? Are Key Actions present?
1.2 Plot Logicality [20%]: Is there a clear cause-and-effect relationship between actions?
1.3 Element Accuracy [20%]:
Characters: Correct count and attributes (gender, age, ethnicity).
Costume/Props: Do they match the script? Are key props missing?
Environment/Time: Accuracy of setting and time (Day/Night, Int/Ext).
1.4 Style & Tone [20%]: Does the vibe and aesthetic style match the narrative?
1.5 Dialogue & Lip-Sync [20%]:
No Ventriloquism: Mouth must move when there is audio/dialogue.
No "Phantom Speech": Mouth must remain shut when silent.
Sync: Movement must match the duration and cadence of the script.

2️⃣ Continuity & Consistency [25%]

Core Question: How is the "memory" of the model? Are there "breaks" in reality?

2.1 Character Consistency [40%]: Face, hair, and clothing must remain identical across shot changes.
2.2 Spatio-Temporal Logic [35%]:
Geographic Anchoring: Fixed room layouts (e.g., windows don't swap sides).
Temporal Flow: Continuous lighting and states.
2.3 Visual Tone Consistency [25%]: Consistency in color temp and grain to avoid a "patchwork" feel.

3️⃣ Cinematic Language [20%]

Core Question: Is there intentionality in the shot composition and camera work?

3.1 Shot Design [50%]: Effective use of Wide, Medium, and Close-up shots; natural transitions.
3.2 Camera Movement & Interaction [30%]:
Motivation: Is the movement (Pan, Tilt, Dolly) purposeful?
Quality: Smooth trajectories; avoiding "random jitter."
3.3 Editing [20%]: Fluid cutting points that hit the "beats" of the action.

4️⃣ Visual Aesthetics & Technical Quality [10%]

Core Question: Is the frame beautiful (ceiling) and is the physics sound (floor)?

4.1 Cinematic Aesthetics [60%]: Adherence to composition rules, lighting depth, and high-fidelity textures.
4.2 Technical Integrity [40%]: Absence of Artifacts, specifically:
Clipping: Limbs merging into objects.
Physical Fallacies: Anatomy errors, anti-gravity, or "sliding" steps.
Rendering: Flickering, noise, or frame tearing.

5️⃣ Performance & Acting [15%]

Core Question: Is the character a "living soul" or a "wooden puppet"?

5.1 Facial Expression & Subtext [40%]: Expressions matching subtext (Anger, Sadness) and micro-expressions (blinking, eye-movement).
5.2 Voice Performance [30%]: Emotional tone and emphasis in speech.
5.3 Physical Acting [30%]: Natural movements following biomechanics and conveying "weight."

🚨 Evaluation Principles

Independence: Each dimension is evaluated separately. High visual quality does not compensate for poor narrative alignment.
Mandatory Failure (Auto-Fail): Severe breakdowns (e.g., "body horror" distortions, background teleportation, or ventriloquism) result in an automatic minimum score (1) for that sub-category.
Distinction First: In Elo comparisons, ties (A=B) should be avoided unless the two clips are truly indistinguishable.

📂 Repository Structure

.
├── AI-Film-Eval-Prompt.csv   # 100 multi-shot screenplay prompts
└── README.md                 # Project overview

🚀 Getting Started

Generate: Use the scripts in AI-Film-Eval-Prompt.csv as inputs for your model.
Evaluate: Follow the 5-dimension framework to conduct a human-in-the-loop audit.
Benchmark: Compare your results against our provided baselines.

License: Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
AI-Film-Eval-Prompt.csv		AI-Film-Eval-Prompt.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-Film-Bench (T2F-Bench) 🎬

⚖️ Evaluation Framework

Weight Distribution Overview

1️⃣ Narrative & Story Alignment [30%]

2️⃣ Continuity & Consistency [25%]

3️⃣ Cinematic Language [20%]

4️⃣ Visual Aesthetics & Technical Quality [10%]

5️⃣ Performance & Acting [15%]

🚨 Evaluation Principles

📂 Repository Structure

🚀 Getting Started

About

Uh oh!

Releases

Packages

showlab/T2F-Bench

Folders and files

Latest commit

History

Repository files navigation

Text-to-Film-Bench (T2F-Bench) 🎬

⚖️ Evaluation Framework

Weight Distribution Overview

1️⃣ Narrative & Story Alignment [30%]

2️⃣ Continuity & Consistency [25%]

3️⃣ Cinematic Language [20%]

4️⃣ Visual Aesthetics & Technical Quality [10%]

5️⃣ Performance & Acting [15%]

🚨 Evaluation Principles

📂 Repository Structure

🚀 Getting Started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages