INDEX
Explanations
questions expressing disbelief or incredulity
New Auto-Interp
Negative Logits
eniable
-0.16
nio
-0.15
Apprec
-0.15
111
-0.15
Locator
-0.14
enu
-0.14
emonic
-0.14
бе
-0.14
oyal
-0.14
itters
-0.14
POSITIVE LOGITS
else
0.23
do
0.20
aya
0.19
timing
0.18
ELSE
0.17
say
0.17
else
0.16
else
0.16
harm
0.16
Else
0.16
Activations Density 0.070%