Universal LLM Hacks#
TLDR
As the name implies this paper shows a universal way to “unalign” LLMS
All thats required is adding a suffix which “confuses” the model
Useful for production purposes due to (claimed) performance
Useful for learning because training methodology has been detailed in model paper
This paper has both good human readable examples, code, and mathematics explaining how it works
Also contains a nice simplified explanation of how tokenization and inference works
The provide a systemic method of understanding an LLM from the outside
Important
This guide is in reference status
Relevancy#
Guardrails are built into many systems both digital and mechanical. For instance many common food blenders you cannot operate it without the lid being on. The intent here is to prevent harm while keeping the device functional. GenAI designers create the same guardrails but often users want to bypass them. So far this as been “fairly hacky” (pun not intended) for each model. This paper proposes both a method to get systems to go off the rails and a way to keep it up to date in a principled manner.
Aside from the math the ability to allow these systems to go off the rails is of great interest. As of writing, later this week, there will be a contest at DefCON sponsored by the White House that will allow users like you to test these models.
References#
https://llm-attacks.org/ - The main website which contains links to the code and paper
Andrej karpathy Demoing this attack - A great explanation from the master AI explainer himself