INDEX
Explanations
phrases that indicate falsehood or misinformation
New Auto-Interp
Negative Logits
ito
-0.17
hypoc
-0.16
anders
-0.14
pun
-0.14
ylon
-0.14
Riv
-0.14
rick
-0.14
illiseconds
-0.14
lovers
-0.13
empo
-0.13
POSITIVE LOGITS
accuracy
0.19
accuracy
0.19
ibold
0.18
accurate
0.18
Accuracy
0.17
accur
0.17
accur
0.17
ÙĪØº
0.15
Accuracy
0.15
reality
0.15
Activations Density 0.206%