Gemini 3.1 Flash-Lite GA: Creator Migration Guide

Google released gemini-3.1-flash-lite as a generally available (GA) stable model on May 7, 2026. At $0.25 per million input tokens and $1.50 per million output tokens, it is the fastest and most cost-efficient model in the Gemini 3 family, accepting text, image, video, audio, and PDF inputs across a 1 million token context window. Developers still on gemini-3.1-flash-lite-preview have until May 11 to migrate -- the preview endpoint deprecates that day and shuts down completely on May 25.

What Is Gemini 3.1 Flash-Lite?

Flash-Lite is Google's entry point into the Gemini 3 model series, built for high-throughput, cost-sensitive workloads. It launched in preview on March 3, 2026, as the first Flash-Lite model in the Gemini 3 generation. The May 7 GA release graduates it from preview to a stable production endpoint with full SLA coverage.

The model scores 86.9% on GPQA Diamond and 76.8% on MMMU Pro -- competitive results for creative and analytical tasks where response speed and cost matter more than maximum reasoning depth. Output throughput sits at 207.5 tokens per second, and context caching lets you store frequently-used prompts or reference documents to reduce cost on repeated calls.

Preview to GA: What Actually Changed

Preview deprecates May 11, shuts down May 25

The capability set is identical between the preview and GA releases. What changed is the stability guarantee: per the official changelog, the GA model carries production SLA coverage and version-stable behavior. Preview models can change or disappear without notice; stable models follow Google's standard versioning commitments, including advance deprecation warnings before any endpoint changes.

No breaking changes were introduced. The migration is a single model ID string update.

Migrate Before May 11: One-Line Fix

The Gemini API deprecations page lists two firm deadlines for the preview model:

May 11, 2026 -- gemini-3.1-flash-lite-preview is deprecated. New API calls may begin returning deprecation warnings.
May 25, 2026 -- The preview endpoint shuts down completely. All requests will fail.

The fix is one line in your code:

// Before
model = "gemini-3.1-flash-lite-preview"

// After
model = "gemini-3.1-flash-lite"

No other API parameters, request formats, or response formats change. If you use the Google AI Python or JavaScript SDKs, update the model string and redeploy.

Capabilities at a Glance

Capability	Gemini 3.1 Flash-Lite
Input modalities	Text, Image, Video, Audio, PDF
Output modalities	Text only
Context window	1,048,576 tokens (1M)
Max output tokens	65,536
Function calling	Yes
Structured outputs	Yes
Configurable thinking	Yes (minimal / low / medium / high)
Batch API	Yes (50% cost reduction)
Context caching	Yes
Search grounding	Yes
Image generation	No
Audio generation	No
Live API	No

Pricing Breakdown

Gemini 3.1 Flash-Lite pricing: $0.25 input, $1.50 output per million tokens

Full pricing is published on the Gemini API pricing page:

Tier	Input (text/image/video)	Output
Free tier	Free	Free
Standard	$0.25 / 1M tokens	$1.50 / 1M tokens
Batch (async, up to 24h)	$0.125 / 1M tokens	$0.75 / 1M tokens

Context caching adds $0.025 per 1M tokens stored plus $1.00 per million tokens per hour. Batch processing cuts all rates by 50% at the cost of asynchronous delivery -- the right choice for overnight asset pipelines, bulk metadata tagging, or weekly audit runs that do not need real-time responses.

Search grounding via Google Search is free for the first 5,000 prompts per month (shared across all Gemini 3 models), then $14 per 1,000 additional queries.

7 Creator Use Cases

Google's official developer guide documents the primary use patterns validated during the preview period. Each maps directly to creative production workflows.

1. Translation at Scale

Pass large volumes of user-generated content -- captions, comments, product descriptions -- with system instructions constraining output to translated text only. Flash-Lite's cost structure makes high-volume multilingual pipelines viable at a fraction of Flash or Pro pricing.

2. Audio Transcription

Upload audio files directly to the API and prompt for formatted transcripts with speaker labels, timestamps, or structured outputs ready for downstream hand-offs. Relevant for podcast creators, voice-over workflows, and accessibility pipelines where you need accurate text fast.

3. Document Processing

PDF parsing, summarization, and cross-document comparison within the 1M token context window. Creative studios can apply this to competitive research, brand guideline extraction, spec-sheet analysis, or any workflow requiring structured data from large documents.

4. Structured Data Extraction

Use Pydantic schemas with structured output mode to extract entities, classify content, or score sentiment from large text corpora. Useful for asset tagging, social listening pipelines, and content moderation at scale.

5. Intelligent Model Routing

Flash-Lite works well as a fast intent classifier that routes requests to more capable models only when needed. Google reports approximately 40% total cost reduction with no quality loss on complex tasks when using this routing pattern. If you already have an async Gemini pipeline, routing is a natural addition to reduce spend on high-volume jobs.

6. Configurable Thinking

Thinking levels (minimal, low, medium, high) let you tune reasoning depth per request. Set minimal for real-time chat responses, medium for code generation, and high for multi-constraint prompts like layout planning or script structure. This avoids paying for deep reasoning on tasks that do not need it.

7. High-Throughput Batch Processing

The Batch API delivers 50% cost savings for non-time-sensitive workloads: bulk image descriptions, overnight content moderation, weekly SEO audits, or retroactive metadata tagging for large asset libraries. Jobs complete within 24 hours. See the Gemini for Creative Work guide for a full walkthrough of integrating the Batch API into a production pipeline.

What to Do Next

Update your model ID now -- change gemini-3.1-flash-lite-preview to gemini-3.1-flash-lite before May 11.
Review the full changelog at ai.google.dev/gemini-api/docs/changelog -- the May 6 Interactions API schema change and May 5 multimodal File Search update may affect your integration.
Enable Batch API -- if any part of your pipeline is non-real-time, the 50% cost savings add up quickly on large volumes.
Start on the free tier -- Flash-Lite is free up to generous usage limits, making it safe to test before committing to paid-tier capacity.

Frequently Asked Questions

What is the difference between Gemini 3.1 Flash-Lite and Gemini 3.1 Flash?

Flash-Lite is optimized for speed and cost efficiency, trading some capability depth for faster throughput and lower per-token pricing. Flash offers higher benchmark scores and deeper reasoning at a higher price point. Use Flash-Lite as your default tier and escalate to Flash or Pro only when task complexity genuinely requires it.

Do I need to rewrite any code to migrate from preview to GA?

No rewrite needed. Change the model ID string from gemini-3.1-flash-lite-preview to gemini-3.1-flash-lite and redeploy. All other API parameters, request schemas, and response formats remain identical.

What happens if I do not migrate before May 11?

The preview model deprecates on May 11, which may introduce deprecation warnings in API responses. The endpoint shuts down completely on May 25. Any application still using gemini-3.1-flash-lite-preview after that date will receive errors on every request. Migrate now to avoid production disruption.

Is Gemini 3.1 Flash-Lite free to use?

Yes. The free tier covers both input and output tokens at no charge. Paid pricing begins at $0.25 per million input tokens and $1.50 per million output tokens once you exceed free tier limits or require paid-tier features like Search Grounding at scale.

Can Flash-Lite generate images or audio?

No. Flash-Lite is a text-output model only. It accepts images, video, audio, and PDFs as inputs for analysis and understanding, but all outputs are text. For image generation, use a dedicated generation endpoint such as Imagen 3 in the Gemini API suite.

What is configurable thinking and when should I use it?

Thinking is an internal reasoning step the model performs before generating a response. Flash-Lite supports four levels: minimal (fastest), low, medium, and high (most thorough). Use minimal for simple lookups and real-time chat, medium for code and content generation, and high for complex multi-step problems like layout planning or constraint-heavy analysis.

How does the Batch API work for creative pipelines?

The Batch API accepts asynchronous jobs that process outside real-time latency requirements, completing within 24 hours. All standard pricing rates are cut by 50% in batch mode. For creative studios, this is ideal for overnight image description runs, weekly metadata audits across a full asset library, or bulk content classification that does not need immediate results.

Gemini 3.1 Flash-Lite Is Now GA: What Creators Need to Know

What Is Gemini 3.1 Flash-Lite?

Preview to GA: What Actually Changed

Migrate Before May 11: One-Line Fix

Capabilities at a Glance

Pricing Breakdown