@lechmazur built an elimination game benchmark that's like LLMs playing Survivor. Models talk to each other, form alliances, and vote each other out.
The logs are a treasure trove of insights into how they'd game the system if told to survive. I built this visualization to explore the conversations and learn.
I see twelve kinds of scary behaviors. Here is an example from each. (Click on the quote to see the context.)