"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 13, 2024

LLM Guardrail Notes

Existing Implementation Solutions

  • Llama Guard
  • Nvidia NeMo
  • Guardrails AI

Measure / Validation for 

  • Free from Unintended Response
  • Fairness
  • Privacy
  • Hallucination

Ref - Link1

NeMo Guardrails contains two key components

  • Input moderation, also referred as jailbreak rail, aims to detect potentially malicious user messages before reaching the dialogue system.
  • Output moderation aims to detect whether the LLM responses are legal, ethical, and not harmful prior to being returned to the user.

Challenges on Designing Guardrails

  • Custom to domain
  • Custom to use cases
  • Training by Zero-shot and Few-shot Prompting
  • Topic relevance, content safety, and application security, ultimately standardizing the behavior of LLMs.

The Llama Guard Safety Taxonomy & Risk Guidelines

  • Violence & Hate
  • Sexual Content
  • Guns & Illegal Weapons
  • Regulated or Controlled Substances
  • Suicide & Self Harm
  • Criminal Planning

Evaluation of LLMs

  • Reliability
  • Safety
  • Usability
  • Compliance 

Keep Exploring!!!!

No comments: