MV-Adapter: Multi-view Consistent Image Generation Made Easy

TL;DR: MV-Adapter is a versatile plug-and-play adapter that enhances T2I models and their derivatives for multi-view generation under various conditions, which helps in 3D generation, 3D texture generation and other applications.

Here we show that MV-Adapter generates viewpoints with elevation ranging from 0 to 30 degrees.

Abstract

Generating multi-view images of an object has important applications in content creation and perception. Existing methods achieved this by making invasive changes to pre-trained text-to-image (T2I) models and performing full-parameter training, leading to three main limitations: (1) High computational costs, especially for high-resolution outputs; (2) Incompatibility with derivatives and extensions of the base model, such as personalized models, distilled few-step models, and plugins like ControlNets; (3) Limited versatility, as they primarily serve a single purpose and cannot handle diverse conditioning signals such as text, images, and geometry. In this paper, we present MV-Adapter to address all the above limitations. MV-Adapter is designed to be a plug-and-play module working on top of pre-trained T2I models. This enables efficient training for high-resolution synthesis while maintaining full compatibility with all kinds of derivatives of the base T2I model. MV-Adapter provides a unified implementation for generating multi-view images from various conditions, facilitating applications such as text- and image-based 3D generation and texturing. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its adaptability and versatility.

Method

MV-Adapter is a plug-and-play adapter that learns multi-view priors transferable to derivatives of T2I models without specific tuning, and enable T2Is to generate multi-view consistent images under various conditions. At inference time, our MV-Adapter, which contains a condition guider (yellow) and the decoupled attention layers (blue), can be directly inserted into a personalized or distilled T2I to constitute the multi-view generator.

Our MV-Adapter consists of two components: (1) a condition guider that encodes camera condition or geometry condition; (2) decoupled attention layers that contain multi-view attention layers for learning multi-view consistency, and optional image cross-attention layers to support image-conditioned generation, where we use the pre-trained U-Net to encode the reference image to extract fine-grained information.

MV-Adapter: Multi-view Consistent Image Generation Made Easy

Abstract

Method

Text-to-Multiview

Image-to-Multiview

Sketch-to-Multiview (with ControlNet)

Text-condition 3D Generation

Image-condition 3D Generation

Text-condition Texture Generation

Image-condition Texture Generation