Currently, text-driven generation models are booming in video editing with their compelling
results.
However, for the face-centric text-to-video generation, challenges remain severe as a suitable
dataset with high-quality videos and highly-relevant texts is lacking.
In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text,
to facilitate the research of facial text-to-video generation tasks.
CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content.
Each video clip is paired with 20 texts generated by the proposed semi-auto text generation
strategy,
which is able to describe both the static and dynamic attributes precisely.
We make comprehensive statistical analysis on videos, texts, and text-video relevance of
CelebV-Text,
verifying its superiority over other datasets.
Also, we conduct extensive self-evaluations to show the effectiveness and potential of
CelebV-Text.
Furthermore, a benchmark is constructed with representative methods to standardize the
evaluation of the facial text-to-video generation task.
For more details of the dataset, please refer to the paper "CelebV-Text: A Large-Scale Facial Text-Video
Dataset".