UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing

Friendly Reminder: If the videos are loading slowly, you can download this page from our supplementary materials and view it locally by double-clicking the "index.html" file.

UniEdit supports both video motion editing in the time axis (i.e., from playing guitar to eating or waving) and various video appearance editing scenarios (i.e., stylization, rigid/non-rigid object replacement, background modification).

We organize the project page as follows:
Section A: Video Editing Results with UniEdit (based on CogVideoX-2B, VideoCrafter2, and LaVie);
Section B: Analysis on the Proposed Components in UniEdit;
Section C: Impact of Hyper-parameter Selection;
Section D: Impact of Mask-Guided Coordination;
Section E: More Results and Visualization;
Section F: Failure Cases Visualization;
Section G: Comparison with State-of-the-Art Methods.

Section A: Video Editing Results with UniEdit

UniEdit + LaVie (Appearance Editing)

UniEdit + LaVie (Motion Editing)

Section B: Analysis on the Proposed Components

Difference between QK and V features in SA-S modules:

+ 'Van Gogh style'

(a) source video

(b) w/o feature replacement

(d) w/ value features replacement + query and key in (b)

Fig B.1.

Observation: The query and key features (in SA-S modules) dictate the spatial structure of the generated video, while the value features tend to influence the texture, including details such as color tones.
To comprehend why we can have inhomogeneous QK and V and their differences, we visualized the results of swapping different features (QK or V) in SA-S modules during style transfer tasks on the source video in Fig. B.1. As can be seen, compared to Fig. B.1(b) with no feature replacement, replacing QK (Fig. B.1(c)) results in the edited video adopting the same spatial structure as the source video. Simultaneously, replacing V eradicates the style information in (Fig. B.1(b)), meaning the texture details from the source video are utilized to replace the style depicted by the target prompt.

Influence of Spatial Structure Control in Motion Editing:

'playing guitar' --> 'eating an apple'

(a). w/o Spatial Structure Control

(b). w/ Spatial Structure Control

Fig B.2.

We explored the role of spatial control in motion editing. The proposed method synthesizes videos with larger modifications when removing the spatial control mechanism on both the motion-reference branch and the main editing branch. We visualized the results in Fig. B.2, from left to right are {reconstruction branch, main editing path, and motion reference branch} respectively. It can be observed that although the motion-reference branch can still generate the target motion without the control of spatial layout, the structure deviates significantly, for example, the raccoon assumes a different pose and location. We regard this as a suboptimal solution because, compared to the results presented in the paper, the results w/o spatial structure control modifies the object position of the source video, leading to a decrease in consistency between the edited result and the source video.

We add the quantitative results below:

Content Preservation	Motion Injection	Structure Control	Frame Similarity (↑)	Textual Alignment (↑)	Frame Consistency (↑)
-	-	-	90.54	28.76	96.99
✓			97.28	29.95	98.12
	✓	✓	91.30	31.48	98.08
✓	✓		96.11	31.37	98.12
✓	✓	✓	96.29	31.43	98.09

Section C: Impact of Hyper-parameter Selection

Appearance Editing (global):

Your browser does not support the image tag.

Fig. C.1. (copied from Fig.8 in the main text)

For appearance editing, it's observed that changing the blend layers (\(t_1\)) in Eq. 4 could effectively adjust the degree to which the edited image remains faithful to the original image. Take stylization in Fig. C.1 as an example, attention map injection on fewer (15) steps produces a stylized output that may not have the same structure as the input, and injection on all 50 steps could obtain videos with almost identical textures but less stylized. The user can adjust the blended layers and steps to realize their desired balancing between stylization and faithfully.

In practice, we empirically found set these values to fixed values, i.e., \(t_0=50, L=10\) (same as MasaCtrl [1]) and \(t_1=25\) can achieve good results on most cases, and we further perform a quantitative study when applying different hyper-parameters:

Metric	Frame Similarity	Textual Alignment	Frame Consistency
\(t_0=20, L=10\)	94.33	31.57	98.09
\(t_0=50, L=10\)	96.29	31.84	98.12
\(t_0=50, L=8\)	96.76	31.25	98.11

Metric	Frame Similarity	Textual Alignment	Frame Consistency
\(t_1=20\)	96.21	30.92	98.06
\(t_1=25\)	96.29	31.43	98.09
\(t_1=30\)	96.50	31.04	98.08

Section D: Impact of Mask-Guided Coordination

Visualization of attention maps in CA-S modules:

Visualization of masks:

Output Video

CA-S mask
(obtained by using threshold on the attention map of 'man')

SAM mask
(obtained by using point guided segmentation)

Synthesis results when using UniEdit with or without mask:

'in NYC Times Square' --> 'in park, in winter'

Result without UniEdit	Result with UniEdit (no mask)
Result with UniEdit (use attention mask in CA-S)	Result with UniEdit (use SAM segmentation mask)
'cat' --> 'dog'
Result without UniEdit	Result with UniEdit (no mask)
Result with UniEdit (use attention mask in CA-S)	Result with UniEdit (use SAM segmentation mask)

To investigate the impact of mask-guided coordination, we begin by visualizing masks obtained from 1) the attention map in CA-S modules; 2) the off-the-shelf segmentation model SAM [1], followed by presenting both qualitative and quantitative results of implementing UniEdit with or without mask-guided coordination.

As verified by previous work [2], the attention maps in CA-S modules contain correspondence information between text and visual features. The underlying intuition is that the attention maps between each word and the spatial features at point (i, j) indicate 'how similar this token is to the spatial feature at this location'. We visualize the text-image cross attention map alongside the synthesized video in Section D. We observe spatial correspondences that align with the video output from the attention map. For instance, areas with higher values of the token 'man' and 'NYC' correspond to the foreground and background, respectively. We further employ a fixed threshold (0.4 in practice) to derive binary segmentation maps from the attention maps. For comparison, we also display the segmentation mask obtained by point prompt on SAM. It's observed that the cross-attention mask is generally accurate and could serve as a reliable proxy in practice when an external segmentor is not available.

We examine the impact of mask-guided coordination through both qualitative and quantitative results across 4 settings: {w/o UniEdit, UniEdit w/o mask, UniEdit with mask from CA-S, UniEdit with mask from SAM}. Qualitatively, the implementation of UniEdit significantly enhances the consistency between the edited videos and the original video. The application of the mask-guided coordination technique further improves the consistency of unedited areas (e.g., color and texture). The quantitative results above align coherently with this analysis.

[1] Kirillov, Alexander, et al. "Segment anything."
[2] Hertz, Amir, et al. "Prompt-to-Prompt Image Editing with Cross-Attention Control."

Section E: More Results and Visualization

Visualization of the reconstruction branch and the motion-reference branch:

'playing guitar' --> 'waving'

'walking' --> 'lying'

Fig. E.1.

The output of each branch is visualized in section E, Fig. E.1, where it is observed that the motion branch (right) generate video with the target motion, and effectively transfer it to the main path (middle); meanwhile, the main path inherit the content from the reconstruction branch (left), thus enhance the consistency of unedited parts.

Qualitative results on the reconstruction branch:

Source Video

Reconstruction Branch Output

Reconstruction Branch Output
(w/ null-text inversion)

Source Video

Reconstruction Branch Output

Reconstruction Branch Output
(w/ null-text inversion)

Fig. E.2.

Metrics	FID (↓)	LPIPS (↓)	PSNR (↑)
DDIM	21.30	0.140	34.26
DDIM + Null-text inversion [4]	17.81	0.158	33.75

As seen, the reconstruction branch is capable of faithfully reconstructing the source video. Therefore, the reconstruction branch retains the content of the source video and can be leveraged for content preservation during the editing process.

Visualization of foreground mask extracted by \(U^2\)-Net:

Fig. E.3.

Section F: Failure Cases Visualization

Edit multiple elements simultaneously:

cat --> dog, + Van Gogh style

raccoon --> panda, play violin --> waving

raccoon --> kangaroo, play violin --> play guitar
(the model do not learn the correct posture for playing the violin, and whether or not our method is used, the model could not generate the posture of "violin on the shoulder")

Complex scene editing:

a shark on the left and several goldfish swim in a tank

We exhibit failure cases in section F. Rows 1-3 showcase when editing multiple elements simultaneously, and we observe a relatively large inconsistency with the source video. A naive solution is that perform editing with UniEdit multiple times. Row 4 visualizes the results when editing video with complex scenes, and the model sometimes could not understand the semantics in the target prompt, resulting in incorrect editing. This may be caused by the base model's limited text understanding power, as discussed in [1]. It could be alleviated by leveraging the reasoning power of MLLM [1], or adapting approaches in complex scenario editing [2].

[1] Huang, Yuzhou, et al. "SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models."
[2] Mao, Qi, et al. "MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance."