INDEX
Explanations
instances of the term "false" in various contexts related to misconceptions or misinformation
New Auto-Interp
Negative Logits
adiens
-0.17
shal
-0.16
amar
-0.16
ActionTypes
-0.16
lp
-0.15
OfClass
-0.15
ram
-0.15
shire
-0.15
zd
-0.15
istles
-0.14
POSITIVE LOGITS
hood
0.27
positives
0.23
-flag
0.21
alarms
0.21
-positive
0.20
alarm
0.19
fully
0.19
/false
0.18
claim
0.17
ivec
0.17
Activations Density 0.032%