What is ComfyUI IPAdapter plus
Dive deep into ComfyUI’s benchmark implementation for IPAdapter models.
Introducing an IPAdapter tailored with ComfyUI’s signature approach. The architecture ensures efficient memory usage, rapid performance, and seamless integration with future Comfy updates.
Harness the prowess of IPAdapters – paramount models for image-to-image conditioning. With a singular reference image, achieve diverse variations amplified with text prompts, controlnets, and masks. Envision it as the epitome of a 1-image Lora.
How to Install ComfyUI IPAdapter plus
Download or git clone this repository inside
The pre-trained models are available on huggingface, download and place them in the
For SD1.5 you need:
- ip-adapter_sd15_light.bin, use this when text prompt is more important than reference images
For SDXL you need:
- ip-adapter_sdxl_vit-h.bin This model requires the use of the SD1.5 encoder despite being for SDXL checkpoints
- ip-adapter-plus_sdxl_vit-h.bin Same as above, use the SD1.5 encoder
Additionally you need the image encoders to be placed in the
- SD 1.5 model (use this also for the SDXL ip-adapter_sdxl_vit-h.bin and ip-adapter-plus_sdxl_vit-h.bin models)
- SDXL model
You can rename them to something easier to remember or put them into a sub-directory.
How to Use ComfyUI IPAdapter plus
There’s a basic workflow included in this repo and a few examples in the examples directory. Usually it’s a good idea to lower the
weight to at least
noise paramenter is an experimental exploitation of the IPAdapter models. You can set it as low as
0.01 for an arguably better result. Please report your experience with the noise option!
More info about the noise option
Basically the IPAdapter sends two pictures for the conditioning, one is the reference the other –that you don’t see– is an empty image that could be considered like a negative conditioning.
What I’m doing is to send a very noisy image instead of an empty one. The
noise parameter determines the amount of noise that is added. A value of
0.01 adds a lot of noise (more noise == less impact becaue the model doesn’t get it); a value of
1.0 removes most of noise so the generated image gets conditioned more.
IMPORTANT: Preparing the reference image
The reference image needs to be encoded by the CLIP vision model. The encoder resizes the image to 224×224 and crops it to the center!. It’s not an IPAdapter thing, it’s how the clip vision works. This means that if you use a portrait or landscape image and the main attention (eg: the face of a character) is not in the middle you’ll likely get undesired results. Use square pictures as reference for more predictable results.
I’ve added a
PrepImageForClipVision node that does all the required operations for you. You just have to select the crop position (top/left/center/etc…) and a sharpening amount if you want.
add_weight option is useful only in case of image batches, do not use otherwise. It effectively doubles the image weight in a batch of images. It’s like sending the same image twice.
In the image below you can see the difference between prepped and not prepped images.
KSampler configuration suggestions
The IPAdapter generally requires a few more
steps than usual, if the result is underwhelming try to add 10+ steps.
euler seem to perform better than others.
The model tends to burn the images a little. If needed lower the CFG scale.
The SDXL models are weird but the
noise option sometimes helps.
IPAdapter + ControlNet
The model is very effective when paired with a ControlNet. In the example below I experimented with Canny. The workflow is in the examples directory.
IPAdapter offers an interesting model for a kind of “face swap” effect. The workflow is provided. Set a close up face as reference image and then input your text prompt as always. The generated character should have the face of the reference. It also works with img2img given a high denoise.
The most effective way to apply the IPAdapter to a region is by an inpainting workflow. Remeber to use a specific checkpoint for inpainting otherwise it won’t work. Even if you are inpainting a face I find that the IPAdapter-Plus (not the face one), works best.
It is possible to pass multiple images for the conditioning with the
Batch Images node. An example workflow is provided; in the picture below you can see the result of one and two images conditioning.
It seems to be effective with 2-3 images, beyond that it tends to blur the information too much.
When sending multiple images you can increase/decrease the weight of each image by using the
IPAdapterEncoder node. The workflow (included in the examples) looks like this:
The node accepts 4 images, but remember that you can send batches of images to each slot.
Troubleshooting of ComfyUI IPAdapter plus
Error: ‘CLIPVisionModelOutput’ object has no attribute ‘penultimate_hidden_states’
You are using an old version of ComfyUI. Update and you’ll be fine.
Error with Tensor size mismatch
You are using the wrong CLIP encoder+IPAdapter Model+Checkpoint combo. Remember that you need to select the CLIP encoder v1.5 for all v1.5 IPAdapter models AND for all models ending with
vit-h (even if they are for SDXL).
Is it true that the input reference image must have the same size of the output image?
No, that’s a metropolitan legend. Your input and output images can be of any size. Remember that all input images are scaled and cropped to 224×224 anyway.