INDEX
Explanations
the presence of negation or arguments against common beliefs
New Auto-Interp
Negative Logits
eres
-0.15
gnore
-0.15
revert
-0.14
ucker
-0.14
aces
-0.14
prov
-0.14
ÑģÑĤоÑĢ
-0.14
-assets
-0.14
OLL
-0.14
Foreground
-0.14
POSITIVE LOGITS
arella
0.16
ulado
0.15
Slash
0.15
enuity
0.15
alth
0.15
iasi
0.14
adel
0.14
Mand
0.14
imity
0.13
chine
0.13
Activations Density 0.091%