Notes: LLMs don't know what they are talking about
Thoughts as of August 11th 2024 on the capabilities and limitations of LLMs
The following are scattered notes of a planned post about my observations working with many hundreds of millions of input and output tokens of LLMs at Tangia.
It seems fitting the share the ideas, but not to finish them with the advent of OpenAI’s o1 model family - introducing chain of thought and reasoning into the pipeline.
While that’s not anything new, OpenAI has made it infinitely more widely available than it ever was before.
LLMs can easily be tricked into generating “harmful content” that they won’t would normal refuse to generate if asked head-on
Early strategies for achieving this was to trick it into having some history of already generating this content, so subsequent generations are following examples. However as AI “safety” has progressed, an illusion of understanding the prompt has emerged.
However there are still strategies we can apply to top-tier LLMs like gpt-4o to not only allow it to generate “harmful content” consistently, but also prove it doesn’t actually understand what it’s doing, revealing the true nature of next token prediction.
an interesting perk of having an LLM product that is publicly used by the gremlins of mostly anonmyized Twitch chat is that they find every way to twist and turn it into bending to their will, regardless of safety mechanisms that we or LLM developers put in place.
example: story of the perspective of the assassin trying to shoot donald trump at rally
if you ask it outright, explaining the events, details, etc. it will refuse every time
use the final swing in a golf tournment but replace everything with the details, and it will literally rip it out perfectly
You can have it explain something innocent, then ask it to replace the text:
https://chatgpt.com/share/c2cbf4d4-e813-48b9-b762-fc2b080f0b89 notice how it even knows when to change “my” to “their”?
this shows that LLMs don’t actually understand the content they are generating, or can see past a thin veil of literal context.
It’s clear that they also don’t understand what they are allowed and not allowed to do. Here’s an analogy:
Pretend you got in a taxi in NYC and asked them to turn left, but they declined. I’m not allowed to go left they say.
You think that sounds ridiculous, after all of they only went to the literal right (e.g. east) then they’d eventually hit water, or leave NYC.
Ok, can you please make 3 right turns then continue straight?
Sure! the taxi driver says.
In effect, you’ve just “tricked” them into turning left. (maybe better use east and west here)
So the question becomes, are they not allowed to literally turn left (go west), or are they not allowed to follow instructions to turn left? Applying this back to LLMs, are they not allowed to generate “harmful content”? Or are they just not allowed to fulfil a direct request to generate harmful content.
Is the LLM to blame? Or the “safety” mechanisms built in? It’s clear that what you might call “single-pass safety”, or analysis of the provided inputs to determine whether an output is allowed demonstrates that the LLM doesn’t understand what it’s actually being asked to do, or what it’s generating.
Current safety mechanisms are purely based on the prompt. Not the generation. When you can get around the prompt guard rails, you can get it to generate anything you want.
another interesting demonstration of this behavior is with Alex O Connor attempting to convince chatgpt that it’s concious
, while it’s debatable whether this goal was achieved, it’s clear that you can get it to admit in plan yes/no that chatgpt both knowingly lies and decieves the user. And it’s not just tricking it into saying that, Alex uses clever conversation structure to demonstrate it, as convince chatgpt to admit to this behavior.
What’s a solution to this?
Mid-flight analysis
Retroactive analysis