INDEX
Explanations
references to concern or abnormality in situations
New Auto-Interp
Negative Logits
mps
-0.16
ramer
-0.15
fait
-0.15
éŀ
-0.14
Wyatt
-0.14
èįĴ
-0.14
ÙĪÙĦÙĪØ¬
-0.13
khó
-0.13
oyal
-0.13
çī¹èī²
-0.13
POSITIVE LOGITS
wrong
0.40
wrong
0.35
Wrong
0.33
Wrong
0.29
WRONG
0.29
_wrong
0.24
fish
0.23
fish
0.19
Fish
0.19
wrongful
0.18
Activations Density 0.060%