Developer Oscar Molnar installed a secondhand Tesla V100 SXM2 into his gaming PC alongside an RTX 4080, building a 32GB dual-GPU setup for under £200 total. The system runs a 27-billion-parameter local model at 32 tokens per second with 128k context, entirely offline.

What Happened

Molnar's RTX 4080 tops out at 16GB of VRAM, which limits the models and context sizes available for local inference. Rather than paying $2,000 or more for an RTX 5090 with 32GB, he sourced a used Tesla V100 SXM2 16GB for £150 and a £50 SXM2-to-PCIe adapter, then installed both GPUs in the same machine.

The dual-GPU configuration gives him 32GB of total VRAM for roughly £200 out of pocket. He documented the full process, including adapter wiring, driver selection, and fan noise mitigation, on his personal blog.

Why It Matters

Consumer GPU memory ceilings are expensive to break through. The RTX 4090 caps at 24GB. The RTX 5090 offers 32GB but costs $2,000 or more. The secondhand datacenter GPU route sidesteps that entirely. Running llama.cpp across both GPUs, Molnar reaches 32 tokens per second with full 128k token context and vision input enabled on a 27B model. That is real-time interactive speed for creative and coding workflows.

This approach is gaining traction among creators and developers who need more VRAM than consumer cards offer, without the cost of professional-grade hardware.

Key Technical Details

  • GPU: Tesla V100 SXM2 16GB (Volta, 5120 CUDA cores, 900 GB/s HBM2 bandwidth) purchased for £150
  • Adapter: SXM2-to-PCIe bare PCB, £50. Fan connector: JST PH2.0 4-pin, rewired to a motherboard fan header for PWM speed control
  • OS and drivers: NixOS with Linux 6.6 and NVIDIA driver branch 535. Driver 560 and newer drop Volta support, so the driver version must be pinned
  • Model: Qwen3.6-27B-MTP at Q5_K_M quantization (19GB), sourced from bartowski's GGUF collection on HuggingFace
  • Performance: 32 tok/s inference, 133 to 160 tok/s prompt processing, 128k token context window, vision input active

The main challenge was the V100's 82-decibel fan noise at full blast. Molnar solved it by wiring a PWM controller between the GPU fan connector and a standard motherboard header. A secondary issue: the GPU occasionally fails to enumerate on warm reboots and requires a full cold restart.

Creator Outcome: What This Enables

With 32GB of local VRAM and a vision-capable model, creators can run image captioning, multimodal content review, long-document summarization, and code editing over large codebases without cloud API costs or rate limits. The Qwen model series is particularly capable for mixed language, vision, and coding tasks. All inference runs offline, with no data leaving the machine.

The setup also works as a stepping stone: adding a third GPU or swapping to a newer server card (V100 32GB or A100 40GB) follows the same SXM-to-PCIe adapter approach.

What to Do Next

Molnar's full guide covers the SXM2 adapter wiring, NixOS driver configuration, fan PWM solution, and llama.cpp multi-GPU launch flags. Tesla V100 SXM2 16GB units appear regularly on used hardware markets in the £100 to £200 range. Confirm driver version compatibility before purchasing: any Volta GPU requires NVIDIA driver branch 535.x or older.