Comparing LLM guardrails across the major platforms

Every platform now ships some kind of guardrail layer. They are not interchangeable, and the gap between marketing copy and production behavior is wide.

By ForthClover EngineeringFebruary 20269 min readSecurity

Every major LLM platform now ships some kind of guardrail layer — Bedrock Guardrails, Azure AI Content Safety, OpenAI's moderation endpoint and built-in policies. The marketing pages make them sound interchangeable. They are not, and the gap between what the docs claim and what actually triggers on production traffic is wide enough that almost every team we audit has at least one false sense of security.

This is the practical view of where each platform's guardrails actually help, where they still leave you exposed, and where you have to build your own layer regardless of which provider you picked.

What guardrails actually do

It helps to be precise. Most "guardrails" today are doing one or more of these jobs:

  • Input filtering. Blocking or sanitising prompts that match a category (violence, sexual content, self-harm, hate). Most are tunable severity classifiers.
  • Output filtering. Doing the same on the model's response before it reaches the user.
  • PII detection and redaction. Identifying names, emails, account numbers, sometimes regulated identifiers like SSNs or PHI patterns.
  • Topic and word filters. Blocking conversations on subjects you have decided are off-limits (your competitors, legal advice, medical claims).
  • Grounding / contextual checks. Trying to detect when the model has gone off the source documents you supplied. The newest and least mature category.

Every platform implements some subset. None of them — yet — implements all of these well enough that you can stop thinking about safety as your problem.

Defense in depth

Three layers every guardrails design needs

Marketing copy usually focuses on the model layer. The input and output layers are where most production failures actually get caught.

USERrequestINPUT LAYERPre-flight checksPrompt-injection scanPII redactionRate / quota limitsTopic allow-listMODEL LAYERProvider safetyBuilt-in safety filterSystem-prompt guardTool-use allow-listToken / cost capOUTPUT LAYERPost-flight checksSchema / JSON validateContent moderationCitation groundingHallucination checkto userFailures in any one layer should be logged and surfaced — not silently dropped.

Bedrock Guardrails

The most fully-featured of the three. You configure a guardrail object once and apply it to any model on Bedrock (or any external endpoint via the standalone ApplyGuardrail API). It covers content categories with tunable severity, denied topics defined in natural language, regex-based word filters, PII detection with a healthy default catalogue, and the most usable contextual grounding check on the market.

Where it's strong: the contextual grounding feature actually catches a meaningful share of RAG ungroundedness when you tune the threshold. The PII coverage is broad and handles common edge cases. Configuration as code (CDK, Terraform) means it lives in the same review process as the rest of your infra.

Where it's weak: latency. Each guardrail call adds meaningful round-trip time, and on a streaming response you're paying that cost twice (input and output). Also, word filters and denied topics are still gameable by determined adversaries — they reduce noise but they are not a real defence against a motivated jailbreaker.

Azure AI Content Safety

Azure's offering is the most enterprise-paperwork-ready of the three. The four core categories (hate, sexual, violence, self-harm) come with severity levels you can map onto your own policy. Prompt Shields handles direct and indirect prompt injection detection — a category most teams underestimate until they get a vendor pen-test report. Groundedness detection exists and is improving but is not yet as configurable as Bedrock's.

Where it's strong: prompt injection detection is the best of the three by a clear margin in our internal red teaming. The integration with Azure OpenAI is seamless and the latency budget is well-managed. The compliance story is unambiguous if your team already runs on Azure.

Where it's weak: PII coverage is comparatively narrow — it focuses on the obvious categories and leaves a lot of domain-specific identifiers (medical record numbers, bank account formats) for you to handle yourself. Custom categories are limited compared to Bedrock's denied topics.

OpenAI's built-in policies and moderation API

OpenAI ships two things that are easy to confuse: the model-level policies (the model itself refuses to produce certain content) and the standalone moderation endpoint (omni-moderation-latest) that classifies arbitrary text into safety categories.

The model-level policies are good but invisible — you can't tune them, you only see them when the model refuses. That's fine for most consumer use cases and infuriating for any legitimate enterprise workflow that touches sensitive subject matter (clinical, legal, financial). The moderation API is fast, cheap, and well-calibrated, but it's a classifier, not a policy engine — you still have to wire it into your own request and response paths and decide what to do when it fires.

Where it's strong: moderation calibration is excellent and the latency is the best of the three. Nothing else to configure, which is also part of the appeal for fast-moving teams.

Where it's weak: no PII detection, no grounding check, no denied-topics configuration. Everything beyond the basic safety categories is still your problem.

Where every platform leaves you exposed

Three categories are still squarely your responsibility on all three platforms, and we treat them as required scope on every production engagement:

  • Domain-specific PII. Anything beyond the obvious — medical record numbers, internal customer IDs, partner account formats. Build a regex or NER layer yourself; do not assume the platform caught it.
  • Tool-call abuse. Guardrails almost never inspect tool arguments. If your agent has a tool that can cost money or do something destructive, you need authorisation logic on the tool itself, not on the model.
  • Output structure violation. Guardrails check content. They do not check whether the model returned the JSON shape your downstream code requires. Use structured outputs and validate every response against a schema.

How we typically combine them

For most production engagements we end up with a layered model that does not depend on any single provider:

  1. A first-pass classifier on every inbound request — usually the platform's own — to drop obviously hostile traffic cheaply.
  2. A schema-validated response from every model call. If the structured output fails to parse, the request is treated as a model failure regardless of content.
  3. A second-pass content check on the model output before it reaches the user, often using the same platform guardrail in output mode.
  4. Domain-specific filters for PII and policy that the platform doesn't cover, written by us and version-controlled like any other code.
  5. Logging of every guardrail trigger to the same trace store you use for the rest of your agent telemetry, so the security team can answer questions after the fact.

Pick the platform whose strengths line up with your biggest risk — Bedrock for the most rounded coverage, Azure for the best prompt-injection defence, OpenAI when you want the lightest-touch option — and assume you will still be writing some guardrail code yourself. The platforms are getting better, but none of them is yet the whole answer.