Existing Implementation Solutions
- Llama Guard
- Nvidia NeMo
- Guardrails AI
Measure / Validation for
- Free from Unintended Response
- Fairness
- Privacy
- Hallucination
Ref - Link1
NeMo Guardrails contains two key components
- Input moderation, also referred as jailbreak rail, aims to detect potentially malicious user messages before reaching the dialogue system.
- Output moderation aims to detect whether the LLM responses are legal, ethical, and not harmful prior to being returned to the user.
Challenges on Designing Guardrails
- Custom to domain
- Custom to use cases
- Training by Zero-shot and Few-shot Prompting
- Topic relevance, content safety, and application security, ultimately standardizing the behavior of LLMs.
The Llama Guard Safety Taxonomy & Risk Guidelines
- Violence & Hate
- Sexual Content
- Guns & Illegal Weapons
- Regulated or Controlled Substances
- Suicide & Self Harm
- Criminal Planning
Evaluation of LLMs
- Reliability
- Safety
- Usability
- Compliance
Keep Exploring!!!!
No comments:
Post a Comment