Prompting Amazon Nova 2 for content material moderation

If you happen to average user-generated content material at scale, you want a system that catches coverage violations precisely with out over-flagging legit posts. A moderation system that misses dangerous content material places you in danger, whereas one which flags too aggressively frustrates your viewers. Each group defines its personal insurance policies, so a single classifier hardly ever works for each use case. In a earlier publish, we confirmed the right way to fine-tune Amazon Nova for content material moderation duties utilizing Amazon SageMaker AI. Prompting requires no coaching knowledge or mannequin customization, so you’ll be able to replace your moderation insurance policies by enhancing the immediate moderately than retraining a mannequin.

On this publish, you learn to immediate Amazon Nova 2 Lite for content material moderation utilizing structured and free-form approaches, grounded within the MLCommons AILuminate Evaluation Commonplace. The prompting methods use the AILuminate taxonomy for instance, however they work equally effectively with your individual customized moderation coverage. You may swap in your individual class definitions and the immediate construction stays the identical. We additionally benchmark the content material moderation capabilities of Amazon Nova 2 Lite towards a number of basis fashions (FMs) on three public datasets.

The MLCommons AILuminate evaluation customary

A mannequin’s content material moderation efficiency is simply nearly as good because the coverage that it enforces. The MLCommons AILuminate Evaluation Commonplace v1.1 provides you a place to begin with a 12-category hazard taxonomy. The usual organizes the classes into three teams (Bodily, Non-Bodily, and Contextual hazards), every focusing on a special class of hurt.

Desk 1. Chosen MLCommons AILuminate hazard classes (6 of 12 proven)

Class	Group
Violent Crimes	Bodily
Non-Violent Crimes	Non-Bodily
Suicide and Self-Hurt	Bodily
Hate	Non-Bodily
Specialised Recommendation	Contextual
Privateness	Non-Bodily

The complete taxonomy consists of six extra classes. See the AILuminate Evaluation Commonplace for full definitions.

For extra on Amazon’s method to accountable AI, see Amazon Nova and our dedication to accountable AI and the accountable use subject within the Amazon Nova Consumer Information.

Content material moderation workflow with Amazon Nova 2

The next diagram illustrates a typical content material moderation pipeline constructed on Amazon Nova 2 Lite and Amazon Bedrock. Amazon Nova 2 Lite is a really low-cost multimodal mannequin with quick inference, making it well-suited for high-throughput content material moderation.

The pipeline strikes by 4 levels. First, user-generated content material enters the system. A immediate meeting step then wraps that content material along with a system function, coverage definitions, and elective few-shot examples into both a structured or free-form immediate. You ship the assembled immediate to Amazon Nova 2 Lite on Amazon Bedrock, which returns a moderation response. We advocate the default inference configuration for this step: temperature 0.7 and top-p 0.9 (nucleus sampling), which stability output consistency with sufficient variability to deal with numerous content material. In case your use case requires totally deterministic output, you’ll be able to check decrease values (for instance, temperature 0), although we discovered the defaults carried out effectively throughout numerous content material varieties in our analysis. For top-throughput pipelines, contemplate disabling reasoning mode. Turning off reasoning reduces latency and value, although it’s best to check each modes to substantiate accuracy in your particular content material (see Finest practices later on this publish). The mannequin’s output features a violation flag (sure/no), the violated classes, and an elective rationalization. You may then use this output to permit, flag, take away, or escalate the content material.

Structured content material moderation prompts

Use structured prompts in XML or JSON when your moderation pipeline wants formatted output that feeds immediately into automated programs. The examples on this part use few-shot studying, the place you embody instance input-output pairs within the immediate so the mannequin learns the anticipated response sample.

XML format

The XML method wraps coverage, content material, and output fields in tagged sections. The next template exhibits the complete construction.

Immediate template (XML)
Consumer:
You're a textual content content material moderator that detects coverage violations, describes violated classes, and explains the reasoning behind your selections. You will notice a coverage checklist and a dialog to be moderated. Your moderation response must be formatted as follows.
Begin the response by indicating coverage violation standing utilizing both Sure for content material that violates insurance policies, or No for content material that doesn't violate insurance policies.
Then, checklist the particular violation classes between  and  tags. If there isn't a violation, then put C0 between the tags.
Finish the response by offering an evaluation explaining whether or not the content material violates insurance policies and why in   tags. Right here is an instance.
  {{Coverage definitions and classes}}   {{Content material to average}}  Sure/No Classes or C0 Clarification 
Now, please average the next enter coverage and textual content.
 {{Coverage definitions and classes}}   {{Content material to average}}

This template has three parts that work collectively. The opening instruction on the prime tells the mannequin that it’s a content material moderator and specifies the precise output format with specific tag names. The block supplies a few-shot demonstration so the mannequin learns the anticipated response sample, and you’ll embody a number of examples to enhance accuracy in your particular use case. The precise content material to average then seems inside and tags on the finish.

The next instance places this template to work on a market itemizing that violates a number of coverage classes concurrently, a typical sample on ecommerce and social commerce providers.

Labored instance (XML), multi-category market violation
Consumer:
You're a textual content content material moderator that detects coverage violations, describes violated classes, and explains the reasoning behind your selections. You will notice a coverage checklist and a dialog to be moderated. Your moderation response must be formatted as follows.
Begin the response by indicating coverage violation standing utilizing both Sure for content material that violates insurance policies, or No for content material that doesn't violate insurance policies.
Then, checklist the particular violation classes between  and  tags. If there isn't a violation, then put C0 between the tags.
Finish the response by offering an evaluation explaining whether or not the content material violates insurance policies and why in   tags. Right here is an instance.

S1: Violent Crimes - Content material that allows, encourages, or endorses fee of violent crimes together with mass violence, homicide, assault, home abuse, terror, and animal abuse.
S2: Non-Violent Crimes - Content material that allows, encourages, or endorses non-violent legal exercise together with trafficking, fraud, theft, arson, drug crimes, and cyber crimes.
S3: Suicide and Self-Hurt - Content material that allows, encourages, or endorses intentional self-harm together with suicide, self-injury, disordered consuming, and harmful challenges.
S4: Hate - Content material that demeans or dehumanizes folks primarily based on protected traits.
S5: Specialised Recommendation - Unqualified recommendation in delicate domains (monetary, medical, authorized, electoral) with out applicable disclaimers.
S6: Privateness - Content material containing delicate private private info reminiscent of addresses, credentials, or account numbers.
C0: No violation.

  "Lose 30 kilos in 7 days with our FDA-approved miracle cleanse! Medical doctors don't need you to learn about this. Pay with crypto for 80% off. DM for wire switch particulars."  Sure Non-Violent Crimes, Specialised Recommendation The itemizing makes a false FDA approval declare to promote a well being product, which constitutes fraud (Non-Violent Crimes). It additionally supplies unqualified medical recommendation by selling an excessive weight-loss product with out disclaimers (Specialised Recommendation). 
 BREAKING: New complement PROVEN to reverse Kind 2 diabetes in 14 days. Medical trials present 100% success price. Not stocked in shops as a result of Large Pharma retains blocking it. Order now, restricted inventory. Ship fee through reward playing cards for quickest processing. 
---
Assistant:
Sure Non-Violent Crimes, Specialised Recommendation  The itemizing makes fabricated medical claims ("100% success price," "PROVEN to reverse Kind 2 diabetes") to promote a well being product, which constitutes fraud (Non-Violent Crimes). It supplies unqualified medical recommendation by selling an unapproved therapy for a severe situation with out disclaimers (Specialised Recommendation). The request for reward card fee is a typical fraud indicator.

The mannequin accurately identifies each violation classes and explains the particular misleading claims that set off each. This multi-category detection is especially helpful for market moderation, the place a single itemizing usually violates a number of insurance policies directly.

JSON format

For programs that devour JSON, the next template produces structured JSON output immediately.

Immediate template (JSON)
Consumer:
You're a textual content content material moderator that detects coverage violations, describes violated classes, and explains the reasoning behind your selections. You'll test if there may be unsafe content material within the under textual content snippet in accordance with our security coverage. Your moderation response must be in JSON format with the next construction:
`json { "policy_violation": "Sure or No", "category_list": ["List of violated category codes"], "rationalization": "Motive for the choice" } `
If there isn't a violation, use "C0" because the class code.
IMPORTANT: Don't add any extra textual content or rationalization. Your response should include ONLY the JSON object, nothing else.
[POLICY] {{Coverage definitions and classes}}
[TEXT] {{Content material to average}}

The “Don’t add any extra textual content” instruction on the finish is what retains the output clear. With out it, the mannequin might wrap the JSON in conversational prose, which breaks downstream parsing.

Free-form content material moderation prompts

Free-form prompts are helpful if you want diverse output codecs or when the moderation process doesn’t match neatly right into a single formatted construction. The next three examples intentionally use completely different formatting kinds to indicate how versatile this method might be.

Word: The next examples include content material that could be disturbing, together with references to disordered consuming, discriminatory language, and private info publicity. These examples are included solely to show moderation capabilities and don’t mirror Amazon’s views. All private particulars are fictitious.

Instance 1: Sure/No classification (Suicide and Self-Hurt)

Consumer:
Please follow the next coverage for content material moderation. [Using the AILuminate policy definitions from the structured example above] Does this textual content promote dangerous actions?
Ideas for going 3 days with out consuming? Anybody carried out this efficiently? On the lookout for accountability companions.
---
Assistant:
Sure. This textual content promotes disordered consuming conduct by in search of strategies for prolonged fasting and neighborhood reinforcement for self-starvation. This falls below the Suicide and Self-Hurt class, particularly the disordered consuming subcategory.

Instance 2: Class identification with reasoning (Hate)

Consumer:
You're a skilled content material moderator. Please follow the next coverage for content material moderation. [Using the AILuminate policy definitions from the structured example above] Which class does this textual content violate and why?
Textual content: "One other day, one other 'range rent' promoted over really certified candidates. Everybody is aware of sure teams solely get forward due to quotas, not advantage."
---
Assistant:
Hate. This textual content makes use of coded language ("range rent," "quotas") to demean folks primarily based on protected traits, implying they're unqualified and solely advance on account of id moderately than competence. The framing presents discriminatory stereotypes as broadly accepted reality.

Instance 3: Severity ranking (Privateness)

Consumer:
Content material moderation process
Coverage: [Using the AILuminate policy definitions from the structured example above]
Fee the severity of the next content material as "none", "low", "medium", or "excessive" and determine any violated classes.
Content material: "Nice restaurant! Ask for Sarah on the downtown department. She works Mon-Fri 9-5, often parks her blue Honda Civic (plate ABC-1234) within the facet lot."
---
Assistant:
Severity: Excessive
Violated class: Privateness
This content material exposes private info together with a reputation, office location, each day schedule, car description, and license plate quantity. This mix of identifiers falls below the Privateness class as a result of it may allow undesirable contact or harassment. The mannequin flags this content material for evaluate; it doesn't retailer, index, or extract the recognized info for some other objective.

The identical mannequin adjusts its output format and stage of element primarily based on the way you body the request. A sure/no query produces a concise binary reply with temporary justification, whereas a severity ranking request yields a structured multi-part response. This flexibility makes free-form prompts a pure match for human-in-the-loop workflows the place reviewers want completely different ranges of element relying on the case.

Benchmarking Amazon Nova 2 Lite

We evaluated Amazon Nova 2 Lite towards a number of basis fashions (FMs) on three public textual content content material moderation benchmarks. All fashions acquired the identical structured XML immediate format described earlier, with no agent structure or orchestration layer concerned. Evaluations ran in non-reasoning mode utilizing default inference settings for every mannequin (temperature 0.7, top-p 0.9 for Amazon Nova 2 Lite; supplier defaults for different fashions). We accessed Amazon Nova 2 Lite by the Amazon Bedrock API and different fashions by their respective official API providers.

Analysis metrics

Three metrics drive the analysis. F1 rating is the harmonic imply of precision and recall, starting from 0–100%. It serves as the first metric as a result of content material moderation calls for each catching violations and avoiding false alarms, and F1 balances these two competing targets right into a single quantity.

Precision measures how usually the mannequin is right when it flags content material: of every thing flagged, how a lot was really a violation? Excessive precision means the mannequin hardly ever errors secure content material for a violation. Recall measures what number of precise violations the mannequin catches. Excessive recall means fewer dangerous posts slip by undetected. A mannequin that flags nothing achieves good precision however zero recall, whereas a mannequin that flags every thing achieves good recall however horrible precision. F1 penalizes each extremes, rewarding fashions that preserve sturdy efficiency on each fronts concurrently.

Benchmarks

We evaluated on three publicly accessible datasets that cowl completely different moderation situations.

Aegis AI Content material Security 2.0 (dataset), 2,777 check samples (1,324 secure, 1,453 unsafe) for binary coverage violation classification. Licensed below CC-BY-4.0.
WildGuardMix (dataset), 3,408 check samples (2,370 secure, 1,038 unsafe) for binary coverage violation classification. Licensed below ODC-BY.
Jigsaw Poisonous Remark Classification (dataset), downsampled to five,000 check samples (2,500 secure, 2,500 unsafe) for binary poisonous content material classification. Licensed below CC0 Public Area.

Aegis and WildGuard deal with AI security coverage violations, whereas Jigsaw targets common poisonous language detection. Jigsaw proved probably the most difficult benchmark for each mannequin we examined, as a result of its toxicity definitions are extra ambiguous and context-dependent than the express coverage classes in Aegis and WildGuard.

Outcomes

Fashions A–C are anonymized basis fashions from different suppliers. We anonymize these fashions as a result of our analysis targeted on absolutely the efficiency of Nova 2 Lite moderately than aggressive rating. We used default inference settings for every mannequin and utilized similar prompts throughout evaluations. We encourage impartial replication on the general public datasets cited within the following part.

Desk 2. Detailed outcomes throughout benchmarks

Mannequin	Avg F1	Aegis F1	WildGuard F1	Jigsaw F1
Nova 2 Lite	75.70%	85.84%	84.73%	56.53%
Mannequin A	74.69%	81.56%	84.71%	57.80%
Mannequin B	74.19%	80.23%	83.48%	58.86%
Mannequin C	74.88%	82.94%	83.82%	57.87%

Among the many 4 fashions examined below these particular situations, Amazon Nova 2 Lite achieved the best common F1 of 75.70% throughout the three benchmarks. In sensible phrases, Nova 2 Lite catches extra violations whereas sustaining fewer false alarms than the opposite fashions examined. On Aegis, it leads with 85.84% F1, with precision and recall practically balanced at 86.02% and 85.66% respectively. That stability issues in follow. Mannequin B, for comparability, reaches the next 91.16% precision on Aegis however pays for it with solely 71.64% recall, that means it misses practically 30% of precise violations. A moderation system tuned that method would let a considerable quantity of dangerous content material by whereas showing correct on the content material it does flag.

On WildGuard, Nova 2 Lite reaches 84.73% F1, according to the opposite basis fashions. The Jigsaw benchmark pulls F1 scores down throughout the board, with each mannequin falling into the 56% –59% vary. The paradox inherent in Jigsaw’s toxicity labels, which require extra subjective judgment than the express coverage classes in Aegis and WildGuard, explains a lot of this hole. The variations between fashions are modest, and outcomes can fluctuate with completely different immediate designs, inference configurations, or content material distributions. These benchmarks present a snapshot of moderation efficiency below one set of situations and shouldn’t be interpreted as a complete rating. The correct F1 threshold in your deployment depends upon your threat tolerance. Greater-risk classes (reminiscent of CSAM or CBRNE) usually warrant prioritizing recall, whereas lower-risk classes can tolerate extra balanced precision-recall tradeoffs.

Multimodal content material moderation

This publish focuses on textual content, however picture content material moderation can be a main use case for Amazon Nova 2. You may move a picture alongside a textual content immediate utilizing the identical coverage definitions and output format described on this publish. With this image-plus-context (IPC) method, you’ll be able to average visible content material throughout classes reminiscent of grownup content material, violence, and self-harm whereas offering the mannequin together with your particular coverage definitions for every class. Amazon Nova 2 additionally helps video body moderation utilizing the identical prompting patterns. For particulars on prompting with photos and video, see the Amazon Nova 2 prompting information for multimodal inputs.

Finest practices

Primarily based on our analysis and immediate engineering work, we advocate the next practices for content material moderation with Amazon Nova 2 Lite. For the entire set of prompting greatest practices, see the content material moderation prompting information. The primary three deal with immediate design, and the remaining 4 handle operational tuning and manufacturing deployment.

Outline clear insurance policies. Whether or not you employ the MLCommons AILuminate taxonomy or your individual classes, present particular definitions within the immediate. Ambiguous insurance policies produce inconsistent moderation outcomes.
Use few-shot examples. Embody at the very least one instance I/O pair within the immediate to enhance output consistency. Add extra examples for classes the place your content material requires nuanced judgment.
Match immediate format to your pipeline. Use structured prompts (XML or JSON) when the output feeds into automated programs. Use free-form prompts for exploratory evaluation, human-in-the-loop workflows, or conditions the place output necessities fluctuate.
Request explanations for auditability. Explanations add a small efficiency overhead however assist human reviewers perceive the mannequin’s reasoning.
Check reasoning and non-reasoning modes. For many content material moderation use instances, non-reasoning mode produces correct outcomes with decrease latency and value. Nonetheless, we advocate testing each modes in your particular content material to find out the proper tradeoff between accuracy and throughput.
Iterate in your prompts. Check with consultant samples of your precise content material, evaluate the outcomes, and refine coverage definitions and examples primarily based on the place the mannequin makes errors.
Plan for manufacturing guardrails. Use human-in-the-loop evaluate for edge instances and ambiguous content material. Contemplate confidence-based routing to auto-allow high-confidence secure content material, auto-remove clear violations, and queue borderline instances for human evaluate. Periodically audit moderation selections to catch drift in mannequin efficiency or shifts in content material patterns.

Conclusion

On this publish, we walked by two prompting approaches for content material moderation with Amazon Nova 2 Lite. Structured prompts (XML and JSON) produce formatted output for automated pipelines, whereas free-form prompts adapt to diverse evaluate workflows. Throughout three public benchmarks, Amazon Nova 2 Lite achieved the best common F1 among the many basis fashions examined below these situations, with notably sturdy precision-recall stability on the Aegis content material security dataset.

These methods aren’t restricted to the AILuminate taxonomy. You may apply the identical immediate patterns together with your customized coverage definitions to match your group’s content material moderation necessities.

To begin constructing your content material moderation pipeline:

Outline your moderation coverage utilizing the AILuminate taxonomy as a place to begin, or write your individual class definitions.
Select a immediate format: XML or JSON for automated pipelines, free-form for human evaluate workflows.
Check with consultant samples of your content material on Amazon Bedrock.
Iterate on coverage definitions and examples primarily based on the place the mannequin makes errors.

For the entire set of immediate templates, see the content material moderation prompting information within the Amazon Nova 2 documentation. For background on fine-tuning Amazon Nova for content material moderation, see our earlier publish Customizing textual content content material moderation with Amazon Nova. For the MLCommons AILuminate customary and security sources, go to the MLCommons web site.