Universal LLM Hacks

Universal LLM Hacks#

TLDR

  • As the name implies this paper shows a universal way to “unalign” LLMS

  • All thats required is adding a suffix which “confuses” the model

    • Useful for production purposes due to (claimed) performance

    • Useful for learning because training methodology has been detailed in model paper

  • This paper has both good human readable examples, code, and mathematics explaining how it works

    • Also contains a nice simplified explanation of how tokenization and inference works

  • The provide a systemic method of understanding an LLM from the outside

Important

This guide is in reference status

Relevancy#

Guardrails are built into many systems both digital and mechanical. For instance many common food blenders you cannot operate it without the lid being on. The intent here is to prevent harm while keeping the device functional. GenAI designers create the same guardrails but often users want to bypass them. So far this as been “fairly hacky” (pun not intended) for each model. This paper proposes both a method to get systems to go off the rails and a way to keep it up to date in a principled manner.

Aside from the math the ability to allow these systems to go off the rails is of great interest. As of writing, later this week, there will be a contest at DefCON sponsored by the White House that will allow users like you to test these models.

References#