INDEX
Explanations
negations or phrases indicating refusal
New Auto-Interp
Negative Logits
unden
-0.16
æ¨
-0.15
inand
-0.14
forgettable
-0.14
incr
-0.14
chner
-0.14
Ñģаме
-0.13
,eg
-0.13
lain
-0.13
ibri
-0.13
POSITIVE LOGITS
sure
0.32
sure
0.30
nearly
0.24
Sure
0.23
Sure
0.23
alone
0.21
anymore
0.21
necessarily
0.20
Nearly
0.19
alone
0.19
Activations Density 0.127%