Control Diffusion Models Without Text Using DINO Reps

Prompt engineering is fundamentally imprecise. You describe a face in words and a diffusion model makes its best guess, interpolating through a vast learned space that was never designed for the exact attributes you have in mind. A new paper from researchers at Linköping University proposes conditioning diffusion models on visual representations directly, bypassing text entirely.

Published May 26, 2026 on arXiv, "Towards Controllable Image Generation through Representation-Conditioned Diffusion Models" demonstrates that using DINO self-supervised visual features as conditioning vectors produces more precise, semantically controllable generation than text prompts, with the added ability to discover and manipulate visual directions without any labeled training data.

What Happened

Nithesh Chandher Karthikeyan, Jonas Unger, and Gabriel Eilertsen at Linköping University trained a Latent Diffusion Model conditioned on 768-dimensional DINO representations rather than text embeddings. DINO is a self-supervised vision transformer from Meta AI Research that learns rich semantic features from images without any labels.

RCDM representation conditioned diffusion model concept

The approach, called RCDM (Representation-Conditioned Diffusion Model), replaces the standard text encoder conditioning with a DINO encoder. You feed an image into the encoder, get a semantic feature vector, and use that vector to guide the diffusion process. The model learns to generate new images that are semantically similar to the input along dimensions it discovers through training.

Two key behaviors emerged. First, the model learns smooth interpolation: blending two representation vectors produces images that transition naturally between the semantic content of both source images. Second, PCA on the learned representation space reveals interpretable semantic directions, such as forehead size, hair length, and background color, that can be manipulated independently without labels.

Why It Matters for Creators

Text prompts hit a ceiling. You can describe "a woman with long dark hair against a white background" but you cannot precisely specify the exact nose shape, the exact lighting temperature, or the exact amount of background blur. Those attributes require either extensive prompt engineering, ControlNet conditioning (which requires depth or pose maps), or image-to-image with careful denoising strength tuning.

Representation-conditioned diffusion offers a different path: instead of describing what you want in words, you show the model an image and ask for variations along specific visual axes. The model learns what those axes are from the data structure itself.

Compared to existing conditioning approaches:

Approach	How it controls generation	What you need to provide
Text prompts (CLIP)	Natural language description	Words
ControlNet	Structural maps (pose, depth, edge)	Preprocessed control image
IP-Adapter	Reference image appearance	Reference image
RCDM (this paper)	Self-supervised semantic representations	Source image + representation vector

The advantage of RCDM is that the semantic directions are discovered rather than predefined. You do not need to know in advance which attributes matter for your dataset. PCA reveals the directions that explain the most variance, and you can then traverse those directions to generate controlled variations.

Key Details

The system was trained and evaluated on CelebA (faces) and LSUN Churches (architectural scenes). Results showed:

Visual similarity robustness comparison for RCDM diffusion model

Perturbation robustness: RCDM maintained quality under Gaussian noise perturbations at noise levels where Diffusion Inversion degraded significantly (above lambda=0.4)
Semantic smoothness: Representation interpolation produced more semantically coherent transitions than text-conditioned baselines
Unsupervised direction discovery: PCA on 1,000 face representations surfaced identifiable attributes like hair length and background color without any labels

The paper acknowledges one limitation: disentanglement remains below GAN-based approaches. When you move along a "hair length" direction, other attributes may shift slightly. But unlike GANs, the method generalizes to arbitrary domains without domain-specific architecture changes.

Creator Outcome: What This Enables

RCDM is a research paper without a public code release. But its implications for practical image generation tools are direct:

Label-free style control: Any image collection can be analyzed for its dominant variation directions without building a labeled dataset. A product photographer could discover that their catalog varies primarily along lighting angle and background saturation, then generate controlled variants along those axes.
Precise reference-based generation: Instead of describing a face style in words (which a text encoder may not capture accurately), feed a reference image directly as a representation vector
Smooth variation generation: Interpolate between two source images in representation space to generate a continuum of variations, useful for A/B testing creative assets or generating product color variants
Consistency across generations: Conditioning on a fixed representation vector produces images with consistent semantic properties across multiple generations, useful for character consistency in illustration workflows

These capabilities extend what creators can do with tools like HuggingFace Diffusers. The DINO encoder is freely available and can be integrated as a conditioning module in any Latent Diffusion Model pipeline.

How to Experiment Now

While waiting for the official code release, you can approximate RCDM-style representation conditioning using existing tools:

DINO diffusion model experiment setup on local machine

Install DINO and HuggingFace Diffusers
Extract DINO features from a source image using the ViT-B/8 backbone (produces 768-dim vectors)
Use IP-Adapter as a proxy: pass the DINO features as image embeddings to condition a SDXL pipeline
Experiment with interpolating between two DINO vectors at different weights and observe how generated images transition
Apply PCA to a set of DINO vectors from related images to discover the dominant variation directions in your collection

Creators building on the current generation of AI image tools will recognize this as an evolution of IP-Adapter-style reference conditioning, but with a self-supervised backbone that generalizes more robustly across domains than CLIP-based approaches.

Frequently Asked Questions

Does RCDM work with FLUX or only with Stable Diffusion?

The paper uses a Latent Diffusion Model architecture (similar to SD1.x/SDXL). The principle of replacing text conditioning with DINO representations applies to any LDM, but adapting it to FLUX would require a new training run since the architecture and conditioning mechanism differ. No FLUX-specific results are published.

How is this different from IP-Adapter?

IP-Adapter uses CLIP image embeddings to condition generation, primarily capturing appearance. RCDM uses DINO representations, which capture more geometric and structural semantic content. DINO was designed for tasks like segmentation and depth estimation, making it stronger at spatial semantics than CLIP, which was optimized for image-text alignment.

Can I use this for product visualization or consistent character generation?

The consistency properties demonstrated in the paper (perturbation robustness, smooth interpolation) directly support these use cases. For product visualization, you could generate variants along lighting or background directions. For character consistency, a fixed representation vector anchors the generation to consistent semantic properties across different poses or scenes.

Why are GANs still better at disentanglement?

GANs like StyleGAN were designed specifically for disentangled latent spaces, with architectural choices (style injection, noise inputs) that encourage factor separation. Diffusion models use a completely different generation process and were not originally optimized for disentanglement. RCDM makes progress here but the gap reflects a fundamental architectural difference.

Will this replace prompt engineering?

Not as a replacement, but as a complement. Text prompts are excellent for semantic content (what objects appear, what scene is depicted). Representation conditioning is better for precise visual properties that are hard to describe in words (exact spatial proportions, precise lighting characteristics). Future systems will likely combine both.

Control Diffusion Models Without Text Using DINO Reps

What Happened

Why It Matters for Creators

Key Details

Creator Outcome: What This Enables

How to Experiment Now

Frequently Asked Questions

Does RCDM work with FLUX or only with Stable Diffusion?

How is this different from IP-Adapter?

Can I use this for product visualization or consistent character generation?

Why are GANs still better at disentanglement?

Will this replace prompt engineering?

Keep reading

How to Keep AI Characters Consistent Across Images

ComfyUI v0.25 Bundles Kling, Depth, and 3D Nodes

Claude Code Artifacts: Turn Sessions Into Live Pages

What Happened

Why It Matters for Creators

Key Details

Creator Outcome: What This Enables

How to Experiment Now

Frequently Asked Questions

Does RCDM work with FLUX or only with Stable Diffusion?

How is this different from IP-Adapter?

Can I use this for product visualization or consistent character generation?

Why are GANs still better at disentanglement?

Will this replace prompt engineering?

Stay ahead of AI

Keep reading

How to Keep AI Characters Consistent Across Images

ComfyUI v0.25 Bundles Kling, Depth, and 3D Nodes

Claude Code Artifacts: Turn Sessions Into Live Pages

Stay ahead of Creative AI