Text-to-Film-Bench (T2F-Bench) is a comprehensive benchmark designed for narrative-driven, multi-shot AI video generation.
While current benchmarks focus on short, descriptive clips, T2F-Bench shifts the focus to storytelling. We provide a dataset of 100 high-quality prompts derived from iconic cinematic scenes, transformed into professional screenplays by Gemini. This benchmark evaluates the capability of AI models to act as "directors," translating complex scripts into coherent, cinematic, and emotionally resonant long-form videos.
The evaluation is conducted from the perspective of a Producer auditing an AI-generated film (multi-shot long video).
| Dimension | Weight | Focus |
|---|---|---|
| 1. Narrative & Story Alignment | 30% | Plot fidelity, logic, elements, and sync. |
| 2. Continuity & Consistency | 25% | Character, spatial, and visual stability. |
| 3. Cinematic Language | 20% | Shot design, camera movement, and editing. |
| 4. Visual & Technical Quality | 10% | Aesthetics vs. technical artifacts. |
| 5. Performance & Acting | 15% | Emotional depth and physical realism. |
Core Question: Did the "AI Director" successfully capture the "drama" required by the script?
-
1.1 Plot Integrity & Completeness [20%]: Does the video follow the narrative arc (Beginning β Middle β End)? Are Key Actions present?
-
1.2 Plot Logicality [20%]: Is there a clear cause-and-effect relationship between actions?
-
1.3 Element Accuracy [20%]:
-
Characters: Correct count and attributes (gender, age, ethnicity).
-
Costume/Props: Do they match the script? Are key props missing?
-
Environment/Time: Accuracy of setting and time (Day/Night, Int/Ext).
-
1.4 Style & Tone [20%]: Does the vibe and aesthetic style match the narrative?
-
1.5 Dialogue & Lip-Sync [20%]:
-
No Ventriloquism: Mouth must move when there is audio/dialogue.
-
No "Phantom Speech": Mouth must remain shut when silent.
-
Sync: Movement must match the duration and cadence of the script.
Core Question: How is the "memory" of the model? Are there "breaks" in reality?
-
2.1 Character Consistency [40%]: Face, hair, and clothing must remain identical across shot changes.
-
2.2 Spatio-Temporal Logic [35%]:
-
Geographic Anchoring: Fixed room layouts (e.g., windows don't swap sides).
-
Temporal Flow: Continuous lighting and states.
-
2.3 Visual Tone Consistency [25%]: Consistency in color temp and grain to avoid a "patchwork" feel.
Core Question: Is there intentionality in the shot composition and camera work?
-
3.1 Shot Design [50%]: Effective use of Wide, Medium, and Close-up shots; natural transitions.
-
3.2 Camera Movement & Interaction [30%]:
-
Motivation: Is the movement (Pan, Tilt, Dolly) purposeful?
-
Quality: Smooth trajectories; avoiding "random jitter."
-
3.3 Editing [20%]: Fluid cutting points that hit the "beats" of the action.
Core Question: Is the frame beautiful (ceiling) and is the physics sound (floor)?
- 4.1 Cinematic Aesthetics [60%]: Adherence to composition rules, lighting depth, and high-fidelity textures.
- 4.2 Technical Integrity [40%]: Absence of Artifacts, specifically:
- Clipping: Limbs merging into objects.
- Physical Fallacies: Anatomy errors, anti-gravity, or "sliding" steps.
- Rendering: Flickering, noise, or frame tearing.
Core Question: Is the character a "living soul" or a "wooden puppet"?
- 5.1 Facial Expression & Subtext [40%]: Expressions matching subtext (Anger, Sadness) and micro-expressions (blinking, eye-movement).
- 5.2 Voice Performance [30%]: Emotional tone and emphasis in speech.
- 5.3 Physical Acting [30%]: Natural movements following biomechanics and conveying "weight."
- Independence: Each dimension is evaluated separately. High visual quality does not compensate for poor narrative alignment.
- Mandatory Failure (Auto-Fail): Severe breakdowns (e.g., "body horror" distortions, background teleportation, or ventriloquism) result in an automatic minimum score (1) for that sub-category.
- Distinction First: In Elo comparisons, ties (A=B) should be avoided unless the two clips are truly indistinguishable.
.
βββ AI-Film-Eval-Prompt.csv # 100 multi-shot screenplay prompts
βββ README.md # Project overview
- Generate: Use the scripts in
AI-Film-Eval-Prompt.csvas inputs for your model. - Evaluate: Follow the 5-dimension framework to conduct a human-in-the-loop audit.
- Benchmark: Compare your results against our provided baselines.
License: Distributed under the MIT License. See LICENSE for more information.