Difference between QK and V features in SA-S modules:
+ 'Van Gogh style'
(a) source video
(b) w/o feature replacement
(c) w/ query and key features replacement
(d) w/ value features replacement + query and key in (b)
Fig B.1.
Observation: The query and key features (in SA-S modules) dictate the spatial structure of the generated video, while the value features tend to influence the texture, including details such as color tones.
To comprehend why we can have inhomogeneous QK and V and their differences, we visualized the results of swapping different features (QK or V) in SA-S modules during style transfer tasks on the source video in Fig. B.1. As can be seen, compared to Fig. B.1(b) with no feature replacement, replacing QK (Fig. B.1(c)) results in the edited video adopting the same spatial structure as the source video. Simultaneously, replacing V eradicates the style information in (Fig. B.1(b)), meaning the texture details from the source video are utilized to replace the style depicted by the target prompt.
Influence of Spatial Structure Control in Motion Editing:
'playing guitar' --> 'eating an apple'
(a). w/o Spatial Structure Control
(b). w/ Spatial Structure Control
Fig B.2.
We explored the role of spatial control in motion editing. The proposed method synthesizes videos with larger modifications when removing the spatial control mechanism on both the motion-reference branch and the main editing branch. We visualized the results in Fig. B.2, from left to right are {reconstruction branch, main editing path, and motion reference branch} respectively. It can be observed that although the motion-reference branch can still generate the target motion without the control of spatial layout, the structure deviates significantly, for example, the raccoon assumes a different pose and location. We regard this as a suboptimal solution because, compared to the results presented in the paper, the results w/o spatial structure control modifies the object position of the source video, leading to a decrease in consistency between the edited result and the source video.
We add the quantitative results below:
Content Preservation | Motion Injection | Structure Control | Frame Similarity (↑) | Textual Alignment (↑) | Frame Consistency (↑) |
- | - | - | 90.54 | 28.76 | 96.99 |
✓ | 97.28 | 29.95 | 98.12 | ||
✓ | ✓ | 91.30 | 31.48 | 98.08 | |
✓ | ✓ | 96.11 | 31.37 | 98.12 | |
✓ | ✓ | ✓ | 96.29 | 31.43 | 98.09 |
Appearance Editing (global):
Fig. C.1. (copied from Fig.8 in the main text)
For appearance editing, it's observed that changing the blend layers (\(t_1\)) in Eq. 4 could effectively adjust the degree to which the edited image remains faithful to the original image. Take stylization in Fig. C.1 as an example, attention map injection on fewer (15) steps produces a stylized output that may not have the same structure as the input, and injection on all 50 steps could obtain videos with almost identical textures but less stylized. The user can adjust the blended layers and steps to realize their desired balancing between stylization and faithfully.
In practice, we empirically found set these values to fixed values, i.e., \(t_0=50, L=10\) (same as MasaCtrl [1]) and \(t_1=25\) can achieve good results on most cases, and we further perform a quantitative study when applying different hyper-parameters:
Metric | Frame Similarity | Textual Alignment | Frame Consistency |
\(t_0=20, L=10\) | 94.33 | 31.57 | 98.09 |
\(t_0=50, L=10\) | 96.29 | 31.84 | 98.12 |
\(t_0=50, L=8\) | 96.76 | 31.25 | 98.11 |
Metric | Frame Similarity | Textual Alignment | Frame Consistency |
\(t_1=20\) | 96.21 | 30.92 | 98.06 |
\(t_1=25\) | 96.29 | 31.43 | 98.09 |
\(t_1=30\) | 96.50 | 31.04 | 98.08 |
Visualization of attention maps in CA-S modules:
Visualization of masks:
Output Video
CA-S mask
(obtained by using threshold on the attention map of 'man')
SAM mask
(obtained by using point guided segmentation)
|
|
||
|
|
||
'cat' --> 'dog' |
|
|
|
|
|
||
To investigate the impact of mask-guided coordination, we begin by visualizing masks obtained from 1) the attention map in CA-S modules; 2) the off-the-shelf segmentation model SAM [1], followed by presenting both qualitative and quantitative results of implementing UniEdit with or without mask-guided coordination.
As verified by previous work [2], the attention maps in CA-S modules contain correspondence information between text and visual features. The underlying intuition is that the attention maps between each word and the spatial features at point (i, j) indicate 'how similar this token is to the spatial feature at this location'. We visualize the text-image cross attention map alongside the synthesized video in Section D. We observe spatial correspondences that align with the video output from the attention map. For instance, areas with higher values of the token 'man' and 'NYC' correspond to the foreground and background, respectively. We further employ a fixed threshold (0.4 in practice) to derive binary segmentation maps from the attention maps. For comparison, we also display the segmentation mask obtained by point prompt on SAM. It's observed that the cross-attention mask is generally accurate and could serve as a reliable proxy in practice when an external segmentor is not available.
We examine the impact of mask-guided coordination through both qualitative and quantitative results across 4 settings: {w/o UniEdit, UniEdit w/o mask, UniEdit with mask from CA-S, UniEdit with mask from SAM}. Qualitatively, the implementation of UniEdit significantly enhances the consistency between the edited videos and the original video. The application of the mask-guided coordination technique further improves the consistency of unedited areas (e.g., color and texture). The quantitative results above align coherently with this analysis.
[1] Kirillov, Alexander, et al. "Segment anything."
[2] Hertz, Amir, et al. "Prompt-to-Prompt Image Editing with Cross-Attention Control."
'playing guitar' --> 'waving'
'walking' --> 'lying'
Fig. E.1.
The output of each branch is visualized in section E, Fig. E.1, where it is observed that the motion branch (right) generate video with the target motion, and effectively transfer it to the main path (middle); meanwhile, the main path inherit the content from the reconstruction branch (left), thus enhance the consistency of unedited parts.
Source Video
Reconstruction Branch Output
Reconstruction Branch Output
(w/ null-text inversion)
Source Video
Reconstruction Branch Output
Reconstruction Branch Output
(w/ null-text inversion)
Fig. E.2.
Metrics | FID (↓) | LPIPS (↓) | PSNR (↑) |
DDIM | 21.30 | 0.140 | 34.26 |
DDIM + Null-text inversion [4] | 17.81 | 0.158 | 33.75 |
As seen, the reconstruction branch is capable of faithfully reconstructing the source video. Therefore, the reconstruction branch retains the content of the source video and can be leveraged for content preservation during the editing process.
Visualization of foreground mask extracted by \(U^2\)-Net:
Fig. E.3.
cat --> dog, + Van Gogh style
raccoon --> panda, play violin --> waving
raccoon --> kangaroo, play violin --> play guitar
(the model do not learn the correct posture for playing the violin, and whether or not our method is used, the model could not generate the posture of "violin on the shoulder")
a shark on the left and several goldfish swim in a tank
We exhibit failure cases in section F. Rows 1-3 showcase when editing multiple elements simultaneously, and we observe a relatively large inconsistency with the source video. A naive solution is that perform editing with UniEdit multiple times. Row 4 visualizes the results when editing video with complex scenes, and the model sometimes could not understand the semantics in the target prompt, resulting in incorrect editing. This may be caused by the base model's limited text understanding power, as discussed in [1]. It could be alleviated by leveraging the reasoning power of MLLM [1], or adapting approaches in complex scenario editing [2].
[1] Huang, Yuzhou, et al. "SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models."
[2] Mao, Qi, et al. "MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance."
|