💳Billing & Cost Control

Sogni Intelligence bills both LLM inference (chat and planning tokens) and creative media work in two token types, and exposes per-request controls to estimate, cap, and explicitly approve media costs before any paid work runs. This page collects the rules and primitives that appear individually on the chat, durable-run, and workflow endpoints.

TL;DR: pass token_type to pick which balance pays; pair durable requests with max_estimated_capacity_units and confirm_cost: true to enforce a hard cap and require explicit approval; watch billing_preview_updated / run_waiting_for_user events for cost-approval pauses.

#Token types

Token	What it pays for	How to get it
SOGNI	Native Sogni Supernet inference — all Sogni-trained / Sogni-hosted models (Qwen LLMs, Z-Image, Chroma, Qwen Image Edit, FLUX, Wan video, Ace-Step 1.5 XL audio, etc.)	Earned from running a worker node, staking, and seasonal leaderboard airdrops; also available on the open market
Premium Spark	External-vendor models (OpenAI GPT Image 2, ByteDance Seedance 2.0 / Fast / Mini, and Alibaba HappyHorse 1.1) and can also pay for any native model	Purchased with a credit card at dashboard.sogni.ai
Free Spark	Native Sogni Supernet (open-source) models only — cannot pay for vendor/premium models. Via the API, Free Spark is further restricted to Z-Image Turbo (`z_image_turbo_bf16`).	Claim the Monthly Boost: 400 free Spark per UTC month, available when your free-Spark balance is under 800

Learn more: SOGNI token vs Spark Points.

#Selecting which token pays

Every paid endpoint accepts a token_type field:

`token_type`	Behavior
`"sogni"`	Pay in SOGNI when supported. Falls back to Spark for vendor-only jobs (e.g. GPT Image 2).
`"spark"`	Pay in Spark for everything in the request. Required for vendor models.
`"auto"`	API picks: SOGNI for native models, Spark for vendor models.

You can send token_type in the JSON body or as the X-Token-Type header; the body wins. Tool execution inside a chat completion or chat run inherits the request's token_type; vendor-model tool calls are normalized to Spark even if the parent request asked for sogni / auto.

{
  "model": "qwen3.6-35b-a3b-gguf-iq4xs",
  "messages": [{"role": "user", "content": "Make a hero image"}],
  "token_type": "spark"
}

#LLM token spend

LLM inference is billable, not just media generation. Every chat turn fires many small billable LLM calls — the assistant rounds plus auxiliary cognition (vision analysis, prompt refinement, transition planning). These are denominated in the request's token_type (Spark or SOGNI) and surfaced as a single per-turn line item in Billing history (category Chat, with input/output/total token counts), distinct from media (category Media).

In durable chat runs the authoritative per-round cost arrives on the llm_spend SSE event (costInToken, costInUSD, tokenType, modelName, and token counts). Dedupe on payload.eventId when tallying.

LLM-token spend is not gated by confirm_cost — only worker / vendor / ffmpeg media work pauses for approval. The cost-control primitives below (max_estimated_capacity_units, confirm_cost, cost-approval pauses) govern media work; LLM tokens bill as they are consumed.

#Vendor-model gating

GPT Image 2 (gpt-image-2), Seedance 2.0 (seedance2, seedance2-mini, seedance2-fast), and HappyHorse 1.1 (happyhorse-1.1-t2v, happyhorse-1.1-i2v, happyhorse-1.1-r2v) are external-vendor models. Two rules apply that don't apply to native models:

They require Premium Spark. A request that asks for them with token_type: "sogni" is normalized to Spark for those tool calls; no automatic fallback to SOGNI tokens.
The router will never pick them on your behalf. You have to name the vendor model explicitly — either by asking for it in the user message so the LLM emits the right tool-call arguments, or by writing the tool call yourself (sogni_tool_execution: false), or by naming it in a workflow step's arguments.model / arguments.videoModel. This prevents surprise Spark spend when the LLM is left to choose.

In chat completions, the simplest path is to name the vendor model in the user message so the LLM picks it up:

{
  "messages": [
    { "role": "user", "content": "Use GPT Image 2 to generate the product hero on wet asphalt with neon rim light." }
  ],
  "model": "qwen3.6-35b-a3b-gguf-iq4xs",
  "token_type": "spark",
  "sogni_tools": "creative-tools"
}

In durable workflows, name the vendor model directly inside the step's arguments:

{
  "input": {
    "title": "GPT Image 2 hero",
    "steps": [
      {
        "id": "hero",
        "toolName": "generate_image",
        "arguments": {
          "prompt": "Product hero on wet asphalt with neon rim light",
          "model": "gpt-image-2",
          "gptImageQuality": "high",
          "outputFormat": "png"
        }
      }
    ]
  },
  "token_type": "spark"
}

For the full vendor-model option matrix (quality flags, output formats, context-image limits, audio windows for Seedance), see Chat Completions → External Media Models.

#Estimated capacity units

Durable creative workflows return an estimated capacity-units value: a shared cross-model unit that the API uses to express "how much paid work is this going to do." Use it for hard caps and pre-flight approval — not as an exact billing total.

compose_workflow (planner) returns estimated_capacity_units alongside the plan and a fits_budget flag.
POST /v1/creative-agent/workflows rejects a request with 422 before persistence if the shared estimate exceeds max_estimated_capacity_units.
The cost preview is best-effort. Final billing is reconciled against actual worker output (steps, resolution, duration, vendor-reported usage).

curl https://api.sogni.ai/v1/creative-agent/workflows \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": { "steps": [/* ... */] },
    "token_type": "spark",
    "max_estimated_capacity_units": 25,
    "confirm_cost": true
  }'

Pair max_estimated_capacity_units on the planner call and on the workflow start to hard-cap cost at both ends — the planner cap guards the LLM's plan; the start cap guards the final submission.

#Cost-approval flow

confirm_cost: true on a durable request means "do not spend until I've seen the estimate and said yes." The run pauses at a user-decision boundary, surfaces the estimate, and waits for an explicit approval before executing paid work.

#In durable chat runs

/v1/chat/runs raises a billing_preview_updated event when the hosted tool returns a preview, then transitions to status: "waiting_for_user" with a waiting payload that tells you what's pending.

id: 5
event: billing_preview_updated
data: {"sequence":5,"type":"billing_preview_updated","at":"...","payload":{"estimated_capacity_units":18,"tokenType":"spark","details":[/* ... */]}}

id: 6
event: run_waiting_for_user
data: {"sequence":6,"type":"run_waiting_for_user","at":"...","payload":{"reason":"cost_approval_required","message":"Approve estimated 18 capacity units in Spark?","details":{"toolCallId":"call_abc","pendingToolCallIds":["call_abc"]}}}

To approve, call POST /v1/chat/runs/:id/confirm-cost with the pending tool call ID, decision: "confirm", and the acceptedCostPreview echoed from the preview event (required for confirm; a stale or mismatched preview is rejected with 409). This resumes the same run:

{
  "tool_call_id": "call_abc",
  "decision": "confirm",
  "acceptedCostPreview": {
    "totalEstimatedCapacityUnits": 18,
    "tokenType": "spark",
    "validityUntil": "2026-05-15T12:05:00.000Z"
  }
}

The confirm body may also carry an overrides object (allowlisted to qualityTier, safeContentFilter, and prompt / prompts) for prompt or quality edits the user made on the approval screen — see Durable Chat Runs → Cost approval. To reject the pending tool call, send decision: "cancel" to the same endpoint. To reject the full run, cancel it with POST /v1/chat/runs/:id/cancel.

The waiting reasons are typed as RunWaitingReason in @sogni-ai/sogni-intelligence-client — ask_clarifying_question, select_media_required, cost_approval_required, safety_review_required, workflow_user_input_required, insufficient_credit, and other. (The @sogni-ai/sogni-protocol enum file publishes the core subset; the two newest reasons ship via the intelligence-client type.) A pause flagged insufficient_credit cannot be resolved by confirming — the held tool already failed for lack of credits; add credits and submit a new run, or cancel.

#In durable workflows

POST /v1/creative-agent/workflows with confirm_cost: true and no prior approval returns the workflow in a waiting_for_user state instead of dispatching the first step. The workflow event stream emits the same billing_preview_updated event. Approve by retrying the start with the estimate acknowledged (passing the same Idempotency-Key), or by following the runtime's resume hint.

To skip the approval step in trusted server-side flows, set confirm_cost: false. The hard cap from max_estimated_capacity_units still applies — an over-budget plan is rejected before any spend.

#Per-tool cost metadata

Every chat turn produces a RunRecord (schema v2 — see Chat Completions → Replay Records). Each tool call inside a round carries optional cost_class and risk_level fields from a shared per-tool cost-metadata table, so UI clients can render cost/risk chips without re-deriving from raw arguments.

{
  "tool_calls": [
    {
      "id": "call_abc",
      "function": { "name": "generate_video" },
      "cost_class": "high",
      "risk_level": "medium"
    }
  ]
}

Sample classes: free (analysis/metadata), low (single image), medium (multi-image), high (video), vendor (external). The exact mapping is defined per tool and surfaced through the protocol package.

#Hard limits worth knowing

Video safety limit. Hosted tool execution blocks any single request that would generate more than 20 minutes of total video content across variations, long-video segments, or batch fan-out. Split larger jobs into multiple workflow runs.
Vision input cap. /v1/chat/completions accepts up to 20 vision images per request, each up to 10 MB and 1024 px on the longest side.
Context-image limits per model. GPT Image 2 edit accepts up to 16 context images; Flux.2 Dev up to 6; Qwen Image Edit 2511 up to 3. The SDK and Sogni Socket enforce these before charging.
Per-account daily ceilings. Account-level rate and spend limits apply on top of any per-request controls; visible in the dashboard.

#Putting it together: a safe request pattern

For a chat run that may invoke paid media tools:

{
  "model": "qwen3.6-35b-a3b-gguf-iq4xs",
  "messages": [
    { "role": "user", "content": "Make a 5-shot product teaser, 9:16, 15s." }
  ],
  "token_type": "spark",
  "max_estimated_capacity_units": 30,
  "confirm_cost": true,
  "Idempotency-Key": "campaign-2026-05-17-001"
}

This gets you:

Spark billing (works for native and vendor models).
A hard cap at 30 capacity units — the run is rejected before persistence if the planner exceeds it.
An explicit user approval step before any paid work runs.
Idempotent retries — a network hiccup on submit won't double-charge.

For a workflow your app already planned (no LLM in the loop):

{
  "input": { "steps": [/* ... */] },
  "token_type": "spark",
  "max_estimated_capacity_units": 25,
  "confirm_cost": false
}

confirm_cost: false is safe here because your app already chose every step — no LLM-driven surprise — and the hard cap still protects against estimation drift.

#Where each control lives

Concept	Where to read more
`token_type` request field, vendor-model gating	Chat Completions → External Media Models
Durable chat run cost-approval flow	Durable Chat Runs
Workflow `max_estimated_capacity_units` / `confirm_cost`	Creative-Agent Workflows
Planner `estimated_capacity_units` / `fits_budget`	Chat Completions → Workflow Planning
`cost_class` + `risk_level` on tool calls	Chat Completions → Replay Records
Token-type enum + waiting-reason enum	`@sogni-ai/sogni-protocol`