VorticVortic
PlatformSolutionsPricingBlogSign inRequest access
Back to all posts
·20 min read·Vortic team

Why we run on free OpenRouter models (and what to watch out for)

Llama 3.3 70B, DeepSeek V3, Qwen 2.5, Gemini 2.0 — the contrarian case for free, and the engineering you need around them.

Executive summary

Vortic defaults to a curated free-tier OpenRouter model roster so we can meter underwriting work per action without pricing every customer into enterprise-seat economics. Free inference is not free operationally: rate limits, flaky JSON, silent model churn, and uneven hallucination rates become your COGS in engineering hours. This article explains the model mix, the five reliability hazards we engineered around, three operational use cases (scenario / features / outcomes / benefits), and how token economics feed buyer-facing credit pricing.

The model roster and why each slot exists

  • Fast band — Llama 3.3 70B Instruct: triage, routing, lightweight transforms.
  • Balanced band — DeepSeek V3: most specialist agents (cost/latency sweet spot).
  • Heavy synthesis — Qwen 2.5 72B Instruct: memo merge where coherence matters most.
  • Multimodal — Gemini 2.0 Flash: messy schedules, scans, mixed layouts.

Assigning models per agent isolates failures: if extraction quality slips, you tune one subgraph—not the entire stack.

Economic leverage for underwriting SaaS

Approximate token economics taught us an uncomfortable truth: premium endpoints make sense for demos, but per-run margins collapse if every specialist fires an expensive model sequentially. Free tiers shift marginal inference toward zero so long as retry/fallback discipline holds—letting us expose transparent credit metering (full pipeline ≈ 18 credits at $0.10/credit during standard programmes) instead of hiding model subsidies inside opaque seat fees.

Buyers care about predictable unit costs at burst, not our provider religion.

The five operational hazards (and mitigations)

1. Aggressive rate limits. OpenRouter free tiers impose tight request budgets; parallel fan-out can hit HTTP 429 under concurrent desks.

Mitigation: streamWithRetry() with exponential backoff on 429/503; optional paid fallback via OPENROUTER_PAID_FALLBACK_MODEL (for example anthropic/claude-haiku-4.5) so operators never hard-fail mid-bind prep.

2. Prose-before-JSON. Models prepend conversational wrappers breaking naive JSON.parse.

Mitigation: Strip wrappers; regex extract first balanced {...} block; validate schema; degrade to envelope types instead of crashing routes.

3. Model churn. Free model IDs disappear without ceremony.

Mitigation: Central registry (lib/llm/models.ts)—swap IDs per agent in one commit; pin versions in traces for replay.

4. Quality variance. Specialist extraction F-scores move when switching families.

Mitigation: Golden-set regression per agent; promote models only when memo completeness + citation density holds.

5. No native tool-schema enforcement. Unlike hosted tool-use APIs, free routes rely on prompt contracts.

Mitigation: Strict JSON schema prompts + validator gates + partial refunds on failed specialist nodes where billing fairness matters.

Use case 1 — Specialty MGA scaling burst submissions post-marketing campaign

Scenario: Thirty-day inbound spike doubles concurrent pipelines; finance insists marginal AI spend stays bounded.

Key features

  • Retry-aware streaming; optional paid fallback only on exhausted backoff buckets.

Outcomes

  • Stable P95 latency versus brittle demos collapsing under parallel fan-out.

Benefits

  • Growth marketing tolerated without emergency procurement negotiations mid-quarter.

Use case 2 — Carrier innovation sandbox benchmarking Opus vs free roster honestly

Scenario: Architecture committee demands apples-to-apples ROI proof before mandating premium models globally.

Key features

  • Trace exports comparing hallucination rates, JSON validity, and wall-clock per golden submission.

Outcomes

  • Evidence-backed decision to allocate premium spend only on synthesis steps genuinely gaining measurable lift.

Benefits

  • Avoids blanket Opus tax starving smaller delegated-authority partners unintentionally.

Use case 3 — Finance controlling AI run-rate with departmental chargebacks

Scenario: Underwriting, claims analytics, and marketing pilots share one AI gateway—finance fears opaque overrun.

Key features

  • Per-agent credit ledger translating engineering topology into invoice lines understandable commercially.

Outcomes

  • Forecast variance narrows; spikes attributable to campaigns or renewal cron—not mysterious usage blobs.

Benefits

  • Sustainable governance culture financing continued automation waves responsibly.

Pricing narrative tying engineering to buyer maths

Illustrative comparison only—your tokens vary:

  • Premium-only stacks might approach $0.30–0.50 raw inference per deep pipeline when chained conservatively.
  • Optimised free-first stacks approach $0 marginal inference when retries succeed— shifting gross margin toward servicing, support, and fabric integrations instead of hiding GPU subsidies.

Customers reward transparent metering aligned with submission throughput—not mystery subsidies collapsing at renewal.

Closing

Pick models like you pick peril datasets: deliberately, per workflow, with receipts. Free routers unlock economics; engineering discipline unlocks trust.

engineeringllmopenrouter
Continue reading
22 min · comparison

Best AI underwriting tools compared (2026 buyer guide)

Read
21 min · AI in insurance

AI in insurance in 2026: practical trends teams actually adopt

Read