INDEX
Explanations
negations and refusals in the text
New Auto-Interp
Negative Logits
nty
-0.15
ãĤ¸ãĤ¢
-0.15
ntag
-0.14
omy
-0.14
afort
-0.13
OVE
-0.13
اÛĮØ´
-0.13
awns
-0.12
='".
-0.12
롯
-0.12
POSITIVE LOGITS
necessarily
0.41
mind
0.34
ever
0.32
even
0.31
exactly
0.29
dare
0.27
necessary
0.26
EVER
0.25
bother
0.25
anymore
0.25
Activations Density 0.170%