INDEX
Explanations
negative statements regarding capabilities and evidence
New Auto-Interp
Negative Logits
contri
-0.15
abr
-0.15
çķĻ
-0.15
stice
-0.14
aby
-0.14
ovie
-0.14
à¹ģหล
-0.14
annon
-0.14
umpt
-0.13
æĽ
-0.13
POSITIVE LOGITS
even
0.26
even
0.25
anywhere
0.24
slightest
0.20
Anywhere
0.19
much
0.19
any
0.19
bother
0.19
meaningful
0.19
really
0.19
Activations Density 0.060%