CelebV-Text: A Large-Scale Facial Text-Video Dataset

CVPR 2023

Jianhui Yu1*, Hao Zhu2*, Liming Jiang3, Chen Change Loy3, Weidong Cai1 , Wayne Wu4†
(*Equal contribution)
1University of Sydney, 2SenseTime Research, 3S-Lab, Nanyang Technological University 4Shanghai AI Laboratory

Why CelebV-Text?

Abstract

Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, CelebV-Text, to facilitate the research of facial text-to-video generation tasks.

CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.

For more details of the dataset, please refer to the paper "CelebV-Text: A Large-Scale Facial Text-Video Dataset".

Overview Video


Face Datasets Comparison

CelebV-Text contains 70,000 video clips with a total duration of around 279 hours. Each video is accompanied by 20 sentences describing 6 designed attributes, including 40 general appearances, 5 detailed appearances, 6 light conditions, 37 actions, 8 emotions, and 6 light directions.

overview

Statistics

(a) Video Analysis

The distributions of each attribute. CelebV-Text has a diverse distribution on each attribute classes.

video_distribution

(b) Text Analysis

The distributions of generated texts. CelebV-Text has a diverse, natural, and scalable texts.

text_distribution

(c) Text-Video Relevance

The distributions of generated texts. CelebV-Text has texts relevant to videos.

relevance



Demo of Videos (Full description of the video)

This young man has double chin, bags under eyes and beard. He has moles on each corner of the mouth. He starts with an expression of happiness, and he then turns to be neutral. To begin with, He talks for a short time, next he turns his head. This video is bright with a daylight color temperature. The light direction is back lighting for the whole time.

This female has arched eyebrows and blond hair. She has high cheekbones. She wears earrings and lipstick. There is a mole in the right cheek and two dimples on both sides of the mouth of the mouth. At the beginning, this woman remains an expression of surprise, and next she turns into an expression of fear, and she then has a face of surprise, and then she has a face of surprise, in the end, she has an expression of sadness. To begin with, this female talks for a short time, then she turns for a short time. The light intensity of the video is bright with a cool white color temperature. The light direction is front lighting for the whole time.

A woman is wearing lipstick. She is young. She has heavy makeup and straight hair. She has a mole on the corner of the left and right eyes, and there are two moles in the middle of the left cheek. The woman has got an expression of surprise throughout the video. This woman wags head while talking for a long time. The light intensity of the video intensity is dark and the color temperature is daylight. The light direction is front lighting for the whole time.

He is chubby, with beard and double chin. He has high cheekbones and goatee, with a mole at the corner of the left face. This man first has an expression of happiness, and next he has an expression of disgust, he eventually has a face of sadness. He talks for a long time. The light intensity of this video is normal with a cool white color temperature. The light direction is front lighting for the whole video.



(a) Demo of General Appearances

He is wearing eyeglasses and he is young. The man is wearing a hat.

She is young. This woman has black hair. She is wearing eyeglasses and wearing a hat.

The young man has beard and 5 o'clock shadow. He has bushy eyebrows and sideburns.

He has beard, and he is wearing goatee and eyeglasses.



(b) Demo of Detailed Appearances

He has two dimples on the left and right of the mouth.

He has round moles on the corner of the left eye and above the eyebrows.

She has a mole under the mouth.

There is a mole under the right eye and one mole on the left side of the nose.



(c) Demo of Actions

To begin with, the male talks for a short time, then he eats for a short time, he finally talks for a short time.

He begins to turn meanwhile talking for a short time, then he turns for a short time.

This man first smiles for a short time, and then eats for a short time, in the end, he chews for a short time.

This woman begins to turns and laughs at the same time for a short time, then she turns for a short time, she eventually turns for a short time.



(d) Demo of Emotions

All the while this woman has a face of happiness.

At the beginning, this man has a surprised face, and then he turns into an angry expression.

This female first has a face of happiness, she then turns into an expression of surprise.

At the beginning, this man has a sad expression, and then he turns surprised, he finally is happy.



(e) Demo of Light Directions

To begin with, the light direction is side lighting with 90 degrees to the left of the face for a short time. The light direction is then side lighting with 45 degrees to the left of the face for a long time. The light direction is then front lighting for a short time. Then the light direction is side lighting with 45 degrees to the left of the face for a long time. In the end, the light direction is front lighting for some time.

The light direction is first front lighting for a long time. Then the light direction is side lighting with 45 degrees to the right of the face for a long time.

To begin with, the light direction is side lighting with 90 degrees to the right of the face for some time. The light direction is then side lighting with 45 degrees to the right of the face for a moderate time.

The light direction is first front lighting for a short time. Then the light direction is side lighting with 45 degrees to the right of the face for a short time. In the end, the light direction is front lighting for a moderate time.



(f) Demo of Light Intensity

The light intensity of the video is dark.

The light intensity of the video is dark.

This video is bright.

The light intensity of the video is normal.



(g) Demo of Light Color Temperature

The color temperature of this video is bright.

The color temperature is cool white.

The color temperature is daylight.

The color temperature is daylight.

Benchmark

We construct a benchmark of the facial text-to-video generation task, for two currently prevalent models (TFGAN and MMVID) on three dataset (MM-Vox, CelebV-HQ* and CelebV-Text).
(*We generate text descriptions for CelebV-HQ based on our designed text templates.)

Table: FVD/FID/CLIPSIM Metrics Comparison with Input Texts about General Appearance
overview

Facial Text-to-Video Generation

We conduct experiments to show the effectiveness of our CelebV-Text dataset based on MMVID, where input texts contain descriptions of different attributes.

Table: FVD/FID/CLIPSIM Metrics Comparison with Input Texts about General Appearance with Emotion/Action.
overview

Generated Results

The young man has arched eyebrows and brown hair. He has bags under eyes. He is first neutral and then happy, and finally he is neutral. He smiles all the time.

The young woman has arched eyebrows. She has brown hair and she has an oval face. She is first surprised and then neutral. She wags her head and then talks.

The young man has brown hair. He has beard and 5 o'clock shadow. He is neutral and then turns he is surprised. He is talking and then he nods his head.

This young woman has a long brown hair. She wears lipstick. She has an oval face. Firstly, she is happy, and then she has a neutral face. She blinks and then gazes.

The man has bushy eyebrows and bags under eyes. He has beard and mustache. He has brown hair. He is first neutral and then turns surprised. He begins with talking and then wagging his head.

She has long black hair. She is young. She has arched eyebrows and bags under eyes. She is happy all the time. She nods her head and then smiles.

The man has beard and 5 o'clock shadow. He has bushy eyebrows. He has an angry face the whole time. He begins to smile and then talks.

She is young. She has arched eyebrows and is wearing lipstick. She has long hair. This woman begins with a neutral face and then surprised. She talks and then turns the head.

The man is young. He has arched eyebrows. The man has a neutral face and then surprised. He wags his head and then talks.

This young woman has wavy and long hair. She wears lipstick and earrings. She keeps a happy face all the time. She first smiles and then blinks.

He has beard and mustache, and he is wearing eyeglasses. He is angry all the time. He talks and then wags the head.

This young female has straight hair. She has long black hair. The woman has arched eyebrows and bags under eyes. She is happy and then turns to be neutral. She smiles and then turns her head.

*The generated videos presented above were interpolated 10 times for better visualization.


Visual ChatGPT Demo

This is a toy example of the application of text-to-face model with ChatGPT. In this demo, we use MMVID simply trained on the porposed CelebV-Text dataset, to demonstrate CelebV-Text's potential in enabling visual GPT applications. In the future, more sophisticated methods prospectively lead to better results.

Agreement

  • The CelebV-Text dataset is available for non-commercial research purposes only.
  • All videos of the CelebV-Text dataset are obtained from the Internet which are not property of our institutions. Our institution are not responsible for the content nor the meaning of these videos.
  • You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the videos and any portion of derived data.
  • You agree not to further copy, publish or distribute any portion of the CelebV-Text dataset. Except, for internal use at a single site within the same organization it is allowed to make copies of the dataset.

Resources

We provide a download tool that automatically fetches and processes videos from YouTube. We highly recommend using this tool to acquire the dataset. In addition, as some links may no longer be available, we host the full version of CelebV-Text. Please contact us if needed.


More Work May Interest You

There are several our previous publications that might be of interest to you.

Face Generation:

CelebV-HQ: A Large-scale Video Facial Attributes Dataset. Zhu et al., ECCV 2022

TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing. Xu et al., CVPR 2022

Human Generation:

3DHumanGAN: Towards Photo-realistic 3D-Aware Human Image Generation. Yang et al., Tech. Report 2022

StyleGAN-Human: A Data-Centric Odyssey of Human. Fu et al., ECCV 2022

Text2Human: Text-Driven Controllable Human Image Generation. Jiang et al., SIGGRAPH 2022


Acknowledgements

CelebV-Text is affiliated with OpenXDLab -- an open platform for X-Dimension high-quality data. This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088).


BibTeX

If you find this helpful, please cite our work:

@inproceedings{yu2022celebvtext,
  title={{CelebV-Text}: A Large-Scale Facial Text-Video Dataset},
  author={Yu, Jianhui and Zhu, Hao and Jiang, Liming and Loy, Chen Change and Cai, Weidong and Wu, Wayne},
  booktitle={CVPR},
  year={2023}
}