INDEX
Explanations
instances of the word "false"
phrases related to false claims or deception
New Auto-Interp
Negative Logits
hens
-1.00
hetti
-0.94
ODY
-0.84
imen
-0.82
scene
-0.80
mun
-0.79
guiActiveUnfocused
-0.79
APTER
-0.76
arya
-0.74
ILE
-0.73
POSITIVE LOGITS
accuser
0.94
false
0.93
guiActiveUn
0.92
unfocusedRange
0.88
positives
0.87
misrepresent
0.84
dich
0.83
falsely
0.80
guiIcon
0.78
false
0.77
Activations Density 0.014%