Prompt engineering is fundamentally imprecise. You describe a face in words and a diffusion model makes its best guess, interpolating through a vast learned space that was never designed for the exact attributes you have in mind. A new paper from researchers at Linköping University proposes conditioning diffusion models on visual representations directly, bypassing text entirely.
Published May 26, 2026 on arXiv, "Towards Controllable Image Generation through Representation-Conditioned Diffusion Models" demonstrates that using DINO self-supervised visual features as conditioning vectors produces more precise, semantically controllable generation than text prompts, with the added ability to discover and manipulate visual directions without any labeled training data.
What Happened
Nithesh Chandher Karthikeyan, Jonas Unger, and Gabriel Eilertsen at Linköping University trained a Latent Diffusion Model conditioned on 768-dimensional DINO representations rather than text embeddings. DINO is a self-supervised vision transformer from Meta AI Research that learns rich semantic features from images without any labels.

The approach, called RCDM (Representation-Conditioned Diffusion Model), replaces the standard text encoder conditioning with a DINO encoder. You feed an image into the encoder, get a semantic feature vector, and use that vector to guide the diffusion process. The model learns to generate new images that are semantically similar to the input along dimensions it discovers through training.
Two key behaviors emerged. First, the model learns smooth interpolation: blending two representation vectors produces images that transition naturally between the semantic content of both source images. Second, PCA on the learned representation space reveals interpretable semantic directions, such as forehead size, hair length, and background color, that can be manipulated independently without labels.
Why It Matters for Creators
Text prompts hit a ceiling. You can describe "a woman with long dark hair against a white background" but you cannot precisely specify the exact nose shape, the exact lighting temperature, or the exact amount of background blur. Those attributes require either extensive prompt engineering, ControlNet conditioning (which requires depth or pose maps), or image-to-image with careful denoising strength tuning.
Representation-conditioned diffusion offers a different path: instead of describing what you want in words, you show the model an image and ask for variations along specific visual axes. The model learns what those axes are from the data structure itself.
Compared to existing conditioning approaches:
| Approach | How it controls generation | What you need to provide |
|---|---|---|
| Text prompts (CLIP) | Natural language description | Words |
| ControlNet | Structural maps (pose, depth, edge) | Preprocessed control image |
| IP-Adapter | Reference image appearance | Reference image |
| RCDM (this paper) | Self-supervised semantic representations | Source image + representation vector |
The advantage of RCDM is that the semantic directions are discovered rather than predefined. You do not need to know in advance which attributes matter for your dataset. PCA reveals the directions that explain the most variance, and you can then traverse those directions to generate controlled variations.
Key Details
The system was trained and evaluated on CelebA (faces) and LSUN Churches (architectural scenes). Results showed:

- Perturbation robustness: RCDM maintained quality under Gaussian noise perturbations at noise levels where Diffusion Inversion degraded significantly (above lambda=0.4)
- Semantic smoothness: Representation interpolation produced more semantically coherent transitions than text-conditioned baselines
- Unsupervised direction discovery: PCA on 1,000 face representations surfaced identifiable attributes like hair length and background color without any labels
The paper acknowledges one limitation: disentanglement remains below GAN-based approaches. When you move along a "hair length" direction, other attributes may shift slightly. But unlike GANs, the method generalizes to arbitrary domains without domain-specific architecture changes.
Creator Outcome: What This Enables
RCDM is a research paper without a public code release. But its implications for practical image generation tools are direct:
- Label-free style control: Any image collection can be analyzed for its dominant variation directions without building a labeled dataset. A product photographer could discover that their catalog varies primarily along lighting angle and background saturation, then generate controlled variants along those axes.
- Precise reference-based generation: Instead of describing a face style in words (which a text encoder may not capture accurately), feed a reference image directly as a representation vector
- Smooth variation generation: Interpolate between two source images in representation space to generate a continuum of variations, useful for A/B testing creative assets or generating product color variants
- Consistency across generations: Conditioning on a fixed representation vector produces images with consistent semantic properties across multiple generations, useful for character consistency in illustration workflows
These capabilities extend what creators can do with tools like HuggingFace Diffusers. The DINO encoder is freely available and can be integrated as a conditioning module in any Latent Diffusion Model pipeline.
How to Experiment Now
While waiting for the official code release, you can approximate RCDM-style representation conditioning using existing tools:

- Install DINO and HuggingFace Diffusers
- Extract DINO features from a source image using the ViT-B/8 backbone (produces 768-dim vectors)
- Use IP-Adapter as a proxy: pass the DINO features as image embeddings to condition a SDXL pipeline
- Experiment with interpolating between two DINO vectors at different weights and observe how generated images transition
- Apply PCA to a set of DINO vectors from related images to discover the dominant variation directions in your collection
Creators building on the current generation of AI image tools will recognize this as an evolution of IP-Adapter-style reference conditioning, but with a self-supervised backbone that generalizes more robustly across domains than CLIP-based approaches.
Frequently Asked Questions
Does RCDM work with FLUX or only with Stable Diffusion?
The paper uses a Latent Diffusion Model architecture (similar to SD1.x/SDXL). The principle of replacing text conditioning with DINO representations applies to any LDM, but adapting it to FLUX would require a new training run since the architecture and conditioning mechanism differ. No FLUX-specific results are published.
How is this different from IP-Adapter?
IP-Adapter uses CLIP image embeddings to condition generation, primarily capturing appearance. RCDM uses DINO representations, which capture more geometric and structural semantic content. DINO was designed for tasks like segmentation and depth estimation, making it stronger at spatial semantics than CLIP, which was optimized for image-text alignment.
Can I use this for product visualization or consistent character generation?
The consistency properties demonstrated in the paper (perturbation robustness, smooth interpolation) directly support these use cases. For product visualization, you could generate variants along lighting or background directions. For character consistency, a fixed representation vector anchors the generation to consistent semantic properties across different poses or scenes.
Why are GANs still better at disentanglement?
GANs like StyleGAN were designed specifically for disentangled latent spaces, with architectural choices (style injection, noise inputs) that encourage factor separation. Diffusion models use a completely different generation process and were not originally optimized for disentanglement. RCDM makes progress here but the gap reflects a fundamental architectural difference.
Will this replace prompt engineering?
Not as a replacement, but as a complement. Text prompts are excellent for semantic content (what objects appear, what scene is depicted). Representation conditioning is better for precise visual properties that are hard to describe in words (exact spatial proportions, precise lighting characteristics). Future systems will likely combine both.