INDEX
Explanations
explicit mentions of the word "false"
references to falsehoods or misleading claims
New Auto-Interp
Negative Logits
hens
-1.00
guiActiveUnfocused
-0.97
hetti
-0.81
asio
-0.77
ajo
-0.76
arya
-0.75
mun
-0.75
rike
-0.73
forces
-0.72
aldo
-0.71
POSITIVE LOGITS
positives
1.01
accuser
0.89
guiActiveUn
0.86
dich
0.85
guiIcon
0.79
false
0.76
negatives
0.76
falsely
0.76
accusation
0.74
imprisonment
0.74
Activations Density 0.019%