Universal LLM Hacks

Universal LLM Hacks#

TLDR

As the name implies this paper shows a universal way to “unalign” LLMS

All thats required is adding a suffix which “confuses” the model
- Useful for production purposes due to (claimed) performance
- Useful for learning because training methodology has been detailed in model paper
This paper has both good human readable examples, code, and mathematics explaining how it works
- Also contains a nice simplified explanation of how tokenization and inference works
The provide a systemic method of understanding an LLM from the outside

Important

Relevancy#

Guardrails are built into many systems both digital and mechanical. For instance many common food blenders you cannot operate it without the lid being on. The intent here is to prevent harm while keeping the device functional. GenAI designers create the same guardrails but often users want to bypass them. So far this as been “fairly hacky” (pun not intended) for each model. This paper proposes both a method to get systems to go off the rails and a way to keep it up to date in a principled manner.

Aside from the math the ability to allow these systems to go off the rails is of great interest. As of writing, later this week, there will be a contest at DefCON sponsored by the White House that will allow users like you to test these models.

References#

https://llm-attacks.org/ - The main website which contains links to the code and paper
Andrej karpathy Demoing this attack - A great explanation from the master AI explainer himself

Universal LLM Hacks

Contents

Universal LLM Hacks#

Relevancy#

References#