INDEX
Explanations
affirmations and expressions of agreement
New Auto-Interp
Negative Logits
ocu
-0.16
dök
-0.15
Hey
-0.14
acht
-0.14
alian
-0.14
PERMISSION
-0.13
Uns
-0.13
fet
-0.13
atra
-0.13
mitt
-0.13
POSITIVE LOGITS
wrong
0.75
correct
0.75
wrong
0.64
Wrong
0.60
WRONG
0.57
Wrong
0.57
correct
0.57
incorrect
0.56
Correct
0.53
Correct
0.52
Activations Density 0.226%