INDEX
Explanations
phrases about deception and the ability to evade consequences
New Auto-Interp
Negative Logits
vat
-0.16
atti
-0.14
coni
-0.14
oucher
-0.14
errer
-0.13
onne
-0.13
Fam
-0.13
bih
-0.13
arin
-0.13
quo
-0.13
POSITIVE LOGITS
impunity
0.32
away
0.26
escape
0.24
Away
0.21
away
0.21
getaway
0.21
escaping
0.21
escape
0.20
escapes
0.20
immunity
0.20
Activations Density 0.134%