VRWKV-Editor:

Reducing Quadratic Complexity in Transformer-Based Video Editing


Abdelilah Aitrouga1,*, Youssef Hmamouche1, Amal El Fallah Saghrouchni1,2,

1Ai movement - Inernational Artificial Itelligence Center of Morocco
2Sorbonne Universite, LIP6 - UMR 7606 CNRS, France


VRWKV-Editor leverages natural language instructions to modify, enhance, or transform video content by fine-tuning a pre-trained text-to-image diffusion model for text-to-video generation.

Abstract

In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7× speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.

Approach

Approach


Pipeline of VRWKV-Editor. Given a text–video pair (e.g., “A man with a backpack hikes on a rocky terrain”) as input, our method leverages pretrained text-to-image diffusion models for text-to-video generation. The input video is first encoded into a discrete latent space, after which our U-Net architecture predicts the injected noise. During inference, a novel video is synthesized by inverting the discrete noise from the input video, guided by an edited prompt (e.g., “An astronaut with a jetpack floats above a Martian landscape, with red rocky terrains and tall”).

For the subsequent results, we showcase some demos of our framework in various scenarios such as Foreground, Background, Style and Global Editing.


🌟 Foreground Transfer 🌟


Foreground editing enables customized foreground object change.


Input Video "A golden retriever stands alert in a forested area, its gaze fixed intently ahead."
Input Video "Two quadrotor drones swim in the blue ocean on a coral reef."
Input Video "Drone flyover of the Canadian National Tower."
Input Video "A dog in the grass under the sun."
Input Video "A white woman is laughing." Input Video "Spider-Man is driving speedly a motorbike in the forest."
Input Video "The Canadian flag on a flagpole moves in the wind." Input Video "Several sharks swim in a tank."

🌟 Background Editing 🌟


Background editing enables customized background editing and replacement.


Input Video "An aircraft carrier at the dock with planes on its deck, presented in a grayscale wartime documentary style."
Input Video "Jeep car turn in the snow."
Input Video "A beautiful lotus with New York City in the background."
Input Video "Wind turbines spin during dusk, sunset." Input Video "A fishing boat sails on the tranquil surface of a moonlit lake, surrounded by towering mountains."
Input Video "a man is driving speedly a motorbike in the snow." Input Video "A man with a backpack hikes on a lunar surface, surrounded by vast craters and the vastness of space."

🌟 Style Editing 🌟


Style editing enables users customizing the structure inheritance from the source video to the target video at different styles.


Input Video "A black swan swimming in a pond with lush greenery in the background, oil painting style." Input Video "A cruise ship sailing through the ocean with a city skyline in the background, in Studio Ghibli style."
Input Video "An empty swing hanging from chains, with fog obscuring trees in the background, Gothic Animation style." Input Video "Steampunk adventurers traveling in a retro-futuristic vehicle through an autumnal forest."
Input Video "a jeep car is moving on the road, cartoon style." Input Video "A large airplane on a wet runway under a twilight sky, all rendered in a somber grayscale tone."
Input Video "A playful corgi dog with its mouth open and tongue out, looking excitedly at the camera, rendered in a sketch style." Input Video "A close-up of exotic, luminous flowers, bioluminescent with hues of neon blue and green."

Comparison


All models employing Stable Diffusion use version 1.4. The default settings provided in their official codebases are used.



Original Prompt: A man with a backpack hikes on a rocky terrain, surrounded by tall, rugged mountains and scattered boulders.

Target Prompt: An astronaut with a jetpack floats above a Martian landscape, with red rocky terrains and tall, alien-like mountains in the backdrop.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero VRWKV-Editor (Ours)

Original Prompt: A rider on a horse jumping over an obstacle in an equestrian competition with a clear sky and other obstacles in the background.


Target Prompt: A rider on a horse jumping over an obstacle in an equestrian competition, rendered in Van Gogh style with swirling skies and vibrant colors.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero VRWKV-Editor (Ours)

Original Prompt: A butterfly with black and orange wings perches on a plant amidst a field of golden grass.


Target Prompt: A dragonfly with shimmering wings perches on a plant amidst a field of golden grass.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero VRWKV-Editor (Ours)

Original Prompt: Two individuals crossing a street at a railway intersection with buildings in the background.


Target Prompt: Two animated characters from a classic video game crossing a pixelated street, with a digitalized cityscape in the background.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero VRWKV-Editor (Ours)

Original Prompt: A close-up of daisies with vibrant yellow centers and white petals.


Target Prompt: A close-up of daisies with vibrant yellow centers and white petals, vibrant strokes of an impressionist painting.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero VRWKV-Editor (Ours)

Original Prompt: A race car performing a drift turn on a track.


Target Prompt: A race car drifting on a track in a grainy, high-contrast black and white film style.

Input Video ControlVideo FateZero Pix2Viode Rerender A Video
Text2Video-Zero CCEdit Tune-A-Video vid2vid-zero VRWKV-Editor (Ours)

BibTeX

@misc{aitrouga2025vrwkveditorreducingquadraticcomplexity,
      title={VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing}, 
      author={Abdelilah Aitrouga and Youssef Hmamouche and Amal El Fallah Seghrouchni},
      year={2025},
      eprint={2509.25998},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.25998}, 
}
    }