Generate Your Talking Avatar from Video Reference

1Nanyang Technological University    2University of Melbourne    3HeyGen Research

Introduction Videos

Below are introduction videos generated by our method, where synthesized avatars present and explain our work. The driving audio is generated by IndexTTS2. Click the tabs below to switch between examples.

Generated Reference
A medium shot captures a man with short brown hair and a light grey t-shirt, positioned centrally, speaking directly into a black microphone. He stands in an indoor setting, likely a home office or studio, with dark walls and two wooden shelves behind him. The shelves display various items: a green potted plant, a framed map, a silver YouTube play button, a small white airplane model, a framed photo, and a stack of books. The man's facial expressions are animated and engaged, conveying enthusiasm as he articulates his words. He frequently uses expressive hand gestures, moving his hands up and down and outward to emphasize points. Bright, even lighting illuminates his face, creating a clear and professional look, while a subtle blue light emanates from the lower left, adding a touch of atmosphere to the dark background. The overall aesthetic is a realistic, cinematic style, focusing on clear communication.
Generated Reference
A medium shot captures a young woman with long, dark hair and a light-colored, textured top, speaking directly to the camera. She stands in a brightly lit, modern indoor setting with a vibrant yellow wall behind her. The background is softly blurred, revealing a grey armchair and decorative shelves with small objects on the right, and a light wooden panel on the left. The woman's expression is cheerful and engaging, with a friendly smile as her mouth moves in sync with her speech. She occasionally nods her head and uses subtle hand gestures to emphasize her words, conveying a sense of confidence and approachability. The lighting is soft and even, creating a warm and inviting atmosphere. The overall aesthetic is a realistic, clean cinematic style.
Generated Reference
Generated Reference
Generated Reference
Generated Reference

Abstract

Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with human-centric rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively.

TAVR Teaser Figure

Talking avatar generation from video reference. (a) Visual comparisons demonstrate that our video reference framework yields significantly better identity preservation compared to the single-image baseline in cross-scene generation. The heatmaps reveal that our method selectively aggregates salient identity cues (e.g., lip shapes and facial silhouettes) from highly correlated frames, while naturally suppressing frames with mismatched poses and expressions. (b) The plot shows that our identity similarity increases with the number of reference frames, confirming that longer reference input is advantageous.

Framework

TAVR Framework Overview

Overview of TAVR framework. Our framework generates high-fidelity talking avatars with customized backgrounds by integrating cross-scene video references. Visual inputs, including the video reference and masked target background, are encoded by the VAE into latents, followed by a Token Selection module to reduce computational redundancy. These tokens, alongside an optional motion latent for longer video synthesis, are concatenated with the noisy target latent and forwarded through adapted Transformer blocks. Within each block, a Reference Self-Attention module extends standard self-attention to jointly process target and reference features. Subsequently, two cross-attention modules inject guidance from the text prompt and audio signals. Notably, the audio module incorporates the corresponding reference audio to inject explicit audio-visual clues into the reference stream, guiding the network to accurately locate and extract the intrinsic speaking dynamics from the reference tokens.

Comparison with Baselines

We compare TAVR against state-of-the-art talking avatar generation methods: StableAvatar, EchoMimicV3, OmniAvatar, HuMo, and LongCat-Video-Avatar. As most baselines are designed for same-scene image referencing, we adapt them for cross-scene evaluation using two distinct testing protocols.

Paradigm I: Direct Cross-Scene Reference

The raw, unedited cross-scene reference image is provided directly to each baseline as the identity reference. Our method (TAVR) takes a reference video instead.

Example 1
Reference Image
Baseline Ref (Image)
Our Ref (Video)
A medium close-up shot frames a young East Asian man from the chest up, positioned in a warm, inviting indoor environment. He has neatly styled black hair and wears modern, clear-framed glasses, complementing his black t-shirt. A small, black lavalier microphone is discreetly clipped to his shirt collar. The man's gaze is directed slightly off-camera to the right, and his mouth moves articulately as he speaks, conveying a thoughtful and composed expression. Behind him, soft, light-colored curtains provide a gentle backdrop, with hints of upholstered furniture or a sofa visible, adding to the cozy atmosphere. The scene is illuminated by soft, warm artificial lighting, creating a pleasant glow that highlights his features and casts subtle, natural shadows. The aesthetic style is realistic and cinematic, presenting a clear and engaging portrait of the speaker.
StableAvatar
EchoMimicV3
OmniAvatar
HuMo
LongCat
Ours (TAVR)
Example 2
Reference Image
Baseline Ref (Image)
Our Ref (Video)
A medium shot captures a young woman with long, wavy brown hair and striking blue eyes, seated on a plush brown sofa. She wears a vibrant green, intricately crocheted top with a V-neck and a delicate necklace. Her expression is engaging and articulate as she speaks directly to the camera, her mouth forming various shapes to emphasize her words. Her hands move subtly to gesture, adding to her expressive communication. The background is a brightly lit, modern indoor space, featuring white walls, a glimpse of red curtains on the left, and blurred shelves or furniture in the distance. A green leafy plant is visible in the foreground on the left, adding a touch of nature. The lighting is soft and even, creating a clear, realistic cinematic aesthetic without harsh shadows, suggesting a well-lit daytime setting. She conveys a calm and confident demeanor, actively communicating with the viewer.
StableAvatar
EchoMimicV3
OmniAvatar
HuMo
LongCat
Ours (TAVR)

Paradigm II: Two-Stage Edited Reference

The cross-scene reference image is first contextually adapted to the target scene using Qwen-Image-Edit, and the edited image is subsequently fed to each baseline.

Example 1
Edited Reference Image
Baseline Ref (Edited)
Our Ref (Video)
A young woman with long blonde hair and a blue, textured top stands on a sunlit urban sidewalk, facing the camera directly. She holds a small, black, furry microphone in her right hand, positioned near her chest. Her expression is cheerful and engaging, with a wide smile as she speaks, her mouth moving clearly. Behind her, a tall black metal fence runs horizontally, separating the sidewalk from a large building. In the mid-ground, a historic-looking building with a prominent clock tower and a flag (green, white, and orange) flying from its roof is visible. Lush green trees line the street further back, and a few cars are parked or moving in the distance. The scene is bathed in bright, natural daylight, casting clear shadows and highlighting the woman's features. The overall aesthetic is a realistic, cinematic portrayal of a bright day in a city.
StableAvatar
EchoMimicV3
OmniAvatar
HuMo
LongCat
Ours (TAVR)
Example 2
Edited Reference Image
Baseline Ref (Edited)
Our Ref (Video)
A medium shot captures a man with short brown hair and a light grey t-shirt, positioned centrally, speaking directly into a black microphone. He stands in an indoor setting, likely a home office or studio, with dark walls and two wooden shelves behind him. The shelves display various items: a green potted plant, a framed map, a silver YouTube play button, a small white airplane model, a framed photo, and a stack of books. The man's facial expressions are animated and engaged, conveying enthusiasm as he articulates his words. He frequently uses expressive hand gestures, moving his hands up and down and outward to emphasize points. Bright, even lighting illuminates his face, creating a clear and professional look, while a subtle blue light emanates from the lower left, adding a touch of atmosphere to the dark background. The overall aesthetic is a realistic, cinematic style, focusing on clear communication.
StableAvatar
EchoMimicV3
OmniAvatar
HuMo
LongCat
Ours (TAVR)

Benchmark Showcases

Additional qualitative results on our benchmark demonstrating generalization across diverse identities, scenes, and motion patterns.

Showcase 1
Generated Reference
A medium close-up shot frames a blonde woman in a brightly lit, modern kitchen and living area. She wears a red and white tank top, her long, wavy hair cascading over her shoulders. She holds a smartphone in her left hand, initially looking down at the screen with her mouth slightly open as if speaking or reacting. The background features white kitchen cabinets, a coffee machine, and a glimpse of a living room with a window and a dark couch, all bathed in soft, natural daylight. As she speaks, her eyes widen, and she shifts her gaze directly towards the viewer, conveying an engaged and thoughtful expression. Her subtle head movements and direct eye contact create a sense of personal address. The scene maintains a realistic and cinematic aesthetic, capturing a candid, conversational moment.
Showcase 2
Generated Reference
A young woman with dark, wavy hair and a brown turtleneck sits in a cozy bedroom, captured in a medium close-up shot. Her expression is calm and conversational as she speaks directly to the camera, holding a small, black square device near her chest with both hands. The background reveals a light blue wall, a large wooden headboard with intricate carvings, and patterned white and green bedding. To her right, a dresser with a large oval mirror reflects a soft, diffused light, suggesting daytime. The overall lighting is even and gentle, creating a warm and inviting atmosphere. Her subtle head movements and hand gestures emphasize her words, conveying an engaged and informative tone. The aesthetic is realistic and cinematic, focusing on the subject within her personal space.
Showcase 3
Generated Reference
A medium shot captures a woman with long, wavy brown hair, wearing a simple white t-shirt and a black cord necklace with a gold ring pendant. She stands in a brightly lit indoor space, likely a room with light-colored walls. Behind her, a metal clothing rack holds several dark garments on black hangers, and to the right, a large mirror with an ornate gold frame is partially visible, reflecting a soft, diffused light. The woman is positioned centrally, looking directly at the camera. Her mouth moves as she speaks, and she uses subtle hand gestures to emphasize her points, raising and lowering her hands gently. Her expression is calm and conversational, conveying an engaged and informative tone. The lighting is soft and even, suggesting natural daylight, creating a clear and realistic cinematic aesthetic.
Showcase 4
Generated Reference
A young woman with medium brown hair and blue eyes is centered in a medium close-up shot, speaking directly to the camera. She wears a brown, black, and white plaid shirt over a black top, accessorized with silver hoop earrings and a delicate silver necklace. Her expression is calm and engaging, with subtle head nods and mouth movements as she articulates words. Her right hand occasionally gestures gently near her chest, while her left hand remains still. The background reveals a cozy indoor setting, likely a bedroom, with a white wall adorned with small, minimalist art pieces including a rainbow and cacti, and a white macrame hanging. A bed with a white and grey patterned duvet and a pink pillow is visible to the left. The scene is brightly and evenly lit, suggesting natural daylight, creating a clear and inviting atmosphere. The overall aesthetic is a realistic cinematic style.
Showcase 5
Generated Reference
A medium shot captures a man with a neatly trimmed beard and dark hair, wearing a plain black t-shirt, speaking directly to the camera. He stands in a brightly lit, minimalist room with white walls. Behind him, a large window with light-colored curtains allows soft, natural daylight to illuminate the scene. On the right side of the frame, a small wooden table holds a vibrant green plant in a red pot and a glowing orange Himalayan salt lamp, casting a warm, ambient light. Two framed, abstract art pieces hang on the wall further back. The man's expression is friendly and engaging, with a slight smile, as he articulates words and uses subtle hand gestures to emphasize his points. The overall mood is calm and inviting, presented in a realistic cinematic style with even, bright lighting.
Showcase 6
Generated Reference
A medium close-up shot frames a woman with long, wavy brown hair, wearing a textured, bright green sweater, as she sits and looks directly into the camera. The background is dimly lit, creating an intimate and focused atmosphere. Warm, soft light emanates from a lamp positioned to the left, casting a gentle glow on a hanging guitar, while a dark green sofa and a white cabinet are subtly visible on the right, receding into soft shadows. The woman speaks calmly and articulately, her mouth moving precisely, and her head occasionally shifts with subtle, natural movements. Her expression is composed and thoughtful, conveying a sense of earnestness and direct engagement with the viewer. The predominant warm lighting suggests an evening or indoor setting, with soft shadows adding depth and a realistic cinematic quality to the scene, fostering a mood of quiet concentration.
Showcase 7
Generated Reference
A young woman with short brown hair, wearing a cream-colored cardigan over a black top, sits in a cozy, festive indoor setting. To her left, a vibrant Christmas tree, adorned with twinkling lights and red and silver ornaments, creates a warm, inviting glow. On her right, a white bookshelf displays various decorative items, including a red-striped gnome and a Christmas stocking. She holds a small black microphone in her right hand and a dark-covered book titled "Heartless Hunter" in her left. Her expression is pleasant and engaging as she speaks directly to the camera, gesturing subtly with her hands. The medium shot, captured at eye-level, highlights her enthusiastic demeanor. The scene is bathed in soft, even indoor lighting, enhancing the cheerful holiday atmosphere with a realistic cinematic aesthetic.
Showcase 8
Generated Reference
A young woman with long, flowing brown hair sits comfortably in a light-colored armchair, positioned centrally in a medium shot. She wears a stylish black off-the-shoulder top and a delicate necklace. Her facial expressions are animated and engaging as she speaks directly to the camera, her mouth moving clearly. She frequently gestures with both hands, palms open, adding emphasis to her words. The background features a light blue wall adorned with a patterned quilt in shades of pink, grey, and white on the left. To the right, a wooden bookshelf filled with various books and decorative items creates a cozy, lived-in atmosphere. The lighting is soft and even, illuminating her face without harsh shadows, suggesting a bright indoor setting. Her demeanor is calm and conversational, conveying a sense of pleasant interaction. The overall aesthetic is a realistic, cinematic style.