Swin-Editor:

Enhancing Creativity and Maintaining Consistency in Text-Driven Video Editing


Swin-Editor leverages natural language instructions to modify, enhance, or transform video content by fine-tuning a pre-trained text-to-image diffusion model for text-to-video generation.

Abstract

Large visual models have recently made considerable progress in Text-to-Video generation thanks to the development of foundation models and multi-modal alignment techniques, making video generation more and more realistic. Current approaches predominantly rely on adapting image-based diffusion models via spatiotemporal attention, but this generally leads to temporal inconsistency and increasing model complexity. This inconsistency is mainly related to the fact those approaches are founded on models that were originally designed for image generation, thus, they do not consider implicitly the spatiotemporal aspect of videos.

In this article, we introduce Swin-Editor, an efficient approach of video editing from text-instruction that expands a diffusion-based Text-to-Image model into Text-to-Video.

Specifically, our focus lies in enhancing the visual quality of the generated videos by incorporating a spatiotemporally factorized video prediction mechanism in the diffusion model. Additionally, to reduce computational complexity and memory requirements, the proposed model includes a Vector Quantized Variational Autoencoders module, intended to quantize and compress the spatiotemporal latent features. The proposed architecture produces a good compromise between multiple evaluation metrics against state-of-the-art models in various scenarios.

Approach

Approach


Pipeline of Swin-Editor. Given a text-video pair (e.g., “kite-surfer in the ocean”) as input, our method leverages the pretrained T2I diffusion models for T2V generation by encoding the input video into a discrete space, then we predict the added noise using our U-net architecture. During inference, we sample a novel video from the discrete noise inverted from the input video, guided by an edited prompt (e.g., “kite-surfer in the ocean at sunset”).

For the subsequent results, we showcase some demos of our framework in various scenarios such as Foreground, Background, Style and Global Editing.


🌟 Foreground Transfer 🌟


Foreground editing enables customized foreground object change.


Input Video "A golden retriever stands alert in a forested area, its gaze fixed intently ahead."
Input Video "Two quadrotor drones swim in the blue ocean on a coral reef."
Input Video "Drone flyover of the Canadian National Tower."
Input Video "A dog in the grass under the sun."
Input Video "A white woman is laughing." Input Video "Spider-Man is driving speedly a motorbike in the forest."
Input Video "The Canadian flag on a flagpole moves in the wind." Input Video "Several sharks swim in a tank."

🌟 Background Editing 🌟


Background editing enables customized background editing and replacement.


Input Video "An aircraft carrier at the dock with planes on its deck, presented in a grayscale wartime documentary style."
Input Video "Jeep car turn in the snow."
Input Video "A beautiful lotus with New York City in the background."
Input Video "Wind turbines spin during dusk, sunset." Input Video "A fishing boat sails on the tranquil surface of a moonlit lake, surrounded by towering mountains."
Input Video "a man is driving speedly a motorbike in the snow." Input Video "A man with a backpack hikes on a lunar surface, surrounded by vast craters and the vastness of space."

🌟 Style Editing 🌟


Style editing enables users customizing the structure inheritance from the source video to the target video at different styles.


Input Video "A black swan swimming in a pond with lush greenery in the background, oil painting style." Input Video "A cruise ship sailing through the ocean with a city skyline in the background, in Studio Ghibli style."
Input Video "An empty swing hanging from chains, with fog obscuring trees in the background, Gothic Animation style." Input Video "Steampunk adventurers traveling in a retro-futuristic vehicle through an autumnal forest."
Input Video "a jeep car is moving on the road, cartoon style." Input Video "A large airplane on a wet runway under a twilight sky, all rendered in a somber grayscale tone."
Input Video "A playful corgi dog with its mouth open and tongue out, looking excitedly at the camera, rendered in a sketch style." Input Video "A close-up of exotic, luminous flowers, bioluminescent with hues of neon blue and green."

Comparison


All models employing Stable Diffusion use version 1.4. The default settings provided in their official codebases are used.



Original Prompt: A man with a backpack hikes on a rocky terrain, surrounded by tall, rugged mountains and scattered boulders.

Target Prompt: An astronaut with a jetpack floats above a Martian landscape, with red rocky terrains and tall, alien-like mountains in the backdrop.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero Swin-Editor (Ours)

Original Prompt: A rider on a horse jumping over an obstacle in an equestrian competition with a clear sky and other obstacles in the background.


Target Prompt: A rider on a horse jumping over an obstacle in an equestrian competition, rendered in Van Gogh style with swirling skies and vibrant colors.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero Swin-Editor (Ours)

Original Prompt: A butterfly with black and orange wings perches on a plant amidst a field of golden grass.


Target Prompt: A dragonfly with shimmering wings perches on a plant amidst a field of golden grass.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero Swin-Editor (Ours)

Original Prompt: Two individuals crossing a street at a railway intersection with buildings in the background.


Target Prompt: Two animated characters from a classic video game crossing a pixelated street, with a digitalized cityscape in the background.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero Swin-Editor (Ours)

Original Prompt: A close-up of daisies with vibrant yellow centers and white petals.


Target Prompt: A close-up of daisies with vibrant yellow centers and white petals, vibrant strokes of an impressionist painting.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero Swin-Editor (Ours)

Original Prompt: A race car performing a drift turn on a track.


Target Prompt: A race car drifting on a track in a grainy, high-contrast black and white film style.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero Swin-Editor (Ours)