Rate limits and concurrency limits for Agents | Policy-as-Code Control

Rate limits and concurrency limits are a policy-as-code control that caps how fast an agent can call models and how many requests it can run at once. In Claw EA, you express those limits in a WPC and bind them to execution using a CST, so the agent runtime cannot “talk itself out” of the limits with a better prompt.

OpenClaw is a good baseline runtime for this because it already separates tool policy from sandboxing, but prompt-only controls still fail under load, retries, or prompt injection. The execution layer must be permissioned so the system can reject requests (for example with 429 behavior) even when the model is pushing to continue.

Step-by-step runbook

Decide what you are limiting and where you want the limit enforced. Common split: concurrency limits at the gateway edge (per job, per agent, per tenant) and rate limits at the model egress (per model route or per key).

Write down targets like “max 4 in-flight model calls per job” and “max 60 requests per minute per agent,” plus what to do on overflow (fail fast vs queue).
Encode the limits into a WPC and treat the policy hash as the change control boundary. Keep the policy small and explicit: which agent IDs it applies to, what counters exist, and what the overflow response must be.

Store the WPC in the WPC registry and reference it by its hash in your deployment notes and approvals.
Issue a CST for the agent job that includes a scope hash for the allowed actions and optionally pins the policy hash to the WPC you just published. This ensures a stolen token cannot be reused broadly, and it prevents a job from silently switching to a looser policy.

Use job-scoped CST binding so replays of the same token in a different job context are rejected.
Route model traffic through clawproxy so every model call produces gateway receipts. If you are using OpenRouter via fal, keep that routing inside clawproxy so the receipt is still emitted for each request.

Make the agent runtime treat 429 responses as a hard “backoff and stop” condition, not as a suggestion to keep trying forever.
Configure OpenClaw tool policy and sandbox settings to reduce the blast radius when throttling triggers retries. For example, keep tool execution sandboxed so a retry storm cannot also become a filesystem or process storm on the host.

Use OpenClaw’s inspector and security audit routines during rollout to catch common configuration footguns (open inbound triggers, elevated tools, permissive profiles).
Capture a proof bundle per job run that includes the gateway receipts, the CST scope hash (and pinned WPC hash if used), and job metadata (agent id, session id, timestamps). Publish the resulting artifact to Trust Pulse for review when you need a human-readable audit trail.

On incidents, your first triage step becomes “verify the bundle, confirm which limits were in effect, and identify where overflow occurred.”

Threat model

Rate and concurrency controls are not just about cost. They are also about stabilizing execution when the agent faces adversarial inputs, unexpected fan-out, or tool feedback loops.

Threat	What happens	Control
Retry storm from transient 5xx or timeouts	Agent creates a self-inflicted DDoS against your model route or proxy, starving other workloads and spiking spend.	Concurrency caps plus fail-fast overflow; enforce at execution boundary via WPC plus CST binding.
Prompt injection causes fan-out	A single inbound message triggers many tool calls (search, browse, summarize), multiplying model calls per user action.	Per-job or per-session concurrency and per-minute rate limits; keep OpenClaw tool policy tight so the agent cannot add new tools to “work around” the limit.
Runaway parallelism in multi-agent workflows	Planner spawns sub-agents that all call models concurrently, saturating quotas and causing cascading failures.	Tenant-level and job-level in-flight caps; require a distinct CST per job so counters cannot be bypassed by spawning new processes.
Noisy neighbor across teams	One team’s agent load affects others and makes latency unpredictable.	Scope limits per org/team baked into WPC, and enforced at the proxy edge with deterministic overflow behavior.
“Prompt-only policy” bypass	The model is instructed to ignore limits, or a different prompt template removes the reminder to throttle.	Permissioned execution: the proxy or runtime rejects excess calls regardless of the prompt, and the result is auditable through gateway receipts.

Policy-as-code example

This is an intentionally small, JSON-like sketch of what teams usually encode in a WPC. The important part is that limits are machine-checked at the execution layer, and the job’s CST can optionally pin the WPC hash so the policy cannot drift mid-run.

{
  "wpc_version": "v1",
  "policy_name": "agent-rate-and-concurrency-guardrails",
  "applies_to": {
    "agent_ids": ["support-triage", "invoice-recon"],
    "environments": ["prod"]
  },
  "limits": {
    "model_calls": {
      "max_in_flight_per_job": 4,
      "max_requests_per_minute_per_agent": 60,
      "overflow": {
        "mode": "fail_fast",
        "http_status": 429
      }
    }
  },
  "enforcement": {
    "require_cst": true,
    "cst_job_binding": true,
    "optional_policy_hash_pinning": true
  },
  "audit": {
    "require_gateway_receipts": true,
    "bundle_on_job_close": true
  }
}

If you also run Microsoft-facing agents, apply the same pattern to outbound calls to Microsoft Graph via official API and limit concurrency per app registration. Pair that with Entra ID Conditional Access and PIM for who can update the WPC and who can mint CSTs, so policy edits do not become an untracked backdoor.

What proof do you get?

For each model call routed through clawproxy you get gateway receipts that can be verified later. Those receipts include enough structured metadata to show which job made the call and when, so an auditor can see whether an overflow condition occurred and how often.

At the end of a run you can produce a proof bundle that packages the receipts plus job metadata, including the CST scope hash and any pinned WPC hash. For human review, you can store and view the resulting artifact in Trust Pulse, which is useful when you need to compare “policy intended” versus “policy actually enforced.”

Rollback posture

Rate and concurrency controls should be easy to tighten quickly and safe to loosen deliberately. Rollbacks should always preserve verifiability: you want to know which policy version was active for each job, even during an emergency change.

Action	Safe rollback	Evidence
Tighten max in-flight (for example 8 to 4)	Publish a new WPC and issue new CSTs pinned to the new WPC hash; let existing jobs complete or stop them explicitly.	Proof bundle shows pinned policy hash; gateway receipts show reduced concurrency after cutover.
Loosen request rate temporarily	Time-box the policy change and require approval for CST issuance; avoid editing an existing WPC in place.	Two distinct WPC hashes in Trust Pulse with timestamps; job-scoped CST binding prevents replay across the window.
Disable throttling during incident triage	Prefer raising caps rather than removing enforcement entirely; keep receipts on so you can reconstruct impact later.	Gateway receipts remain continuous; proof bundles remain verifiable even if limits were higher.
Change agent behavior to reduce load	Adjust OpenClaw tool policy and skill prompts, but treat these as secondary controls that do not replace enforcement.	OpenClaw configuration change records plus the same proxy receipts demonstrate whether behavior actually changed.

FAQ

What is the difference between rate limits and concurrency limits?

Rate limits cap how many requests happen over time (for example per minute). Concurrency limits cap how many requests can be in flight at the same moment, which is what prevents bursty parallel fan-out.

Why can’t we just instruct the agent to “go slower” in the prompt?

Prompts are not enforcement. Under retries, injection, or “helpful” behavior, the model will still attempt extra calls unless the execution layer rejects them.

Where should limits be enforced in an OpenClaw deployment?

Keep the agent’s local tool execution constrained using OpenClaw tool policy and sandbox settings, and enforce model-call limits at the proxy boundary. That keeps limits consistent across agents and produces uniform gateway receipts for audit.

How do WPC and CST work together for throttling?

The WPC defines the rules, and the CST is the job credential that can carry a scope hash and optionally pin a specific policy hash. That combination makes policy selection explicit and limits the blast radius of token reuse.

What does an auditor actually verify?

They verify gateway receipts and the proof bundle, then check that the job’s CST scope hash and pinned WPC hash match the policy that was approved. If throttling was triggered, the receipts show the timing and the overflow outcomes.

Rate limits and concurrency limits for Agents | Policy-as-Code Control

Step-by-step runbook

Threat model

Policy-as-code example

What proof do you get?

Rollback posture

FAQ

What is the difference between rate limits and concurrency limits?

Why can’t we just instruct the agent to “go slower” in the prompt?

Where should limits be enforced in an OpenClaw deployment?

How do WPC and CST work together for throttling?

What does an auditor actually verify?

Sources

See how this applies to your environment

Rate limits and concurrency limits for Agents | Policy-as-Code Control

Step-by-step runbook#

Threat model#

Policy-as-code example#

What proof do you get?#

Rollback posture#

FAQ#

What is the difference between rate limits and concurrency limits?#

Why can’t we just instruct the agent to “go slower” in the prompt?#

Where should limits be enforced in an OpenClaw deployment?#

How do WPC and CST work together for throttling?#

What does an auditor actually verify?#

Sources#

See how this applies to your environment

Related

Step-by-step runbook

Threat model

Policy-as-code example

What proof do you get?

Rollback posture

FAQ

What is the difference between rate limits and concurrency limits?

Why can’t we just instruct the agent to “go slower” in the prompt?

Where should limits be enforced in an OpenClaw deployment?

How do WPC and CST work together for throttling?

What does an auditor actually verify?

Sources