INDEX
Explanations
phrases indicating certainty or strong opinions
New Auto-Interp
Negative Logits
NOTHING
-0.20
NONE
-0.19
NONE
-0.17
ENTE
-0.17
anything
-0.17
anything
-0.16
nothing
-0.15
none
-0.15
олов
-0.15
icont
-0.15
POSITIVE LOGITS
cach
0.17
absolutely
0.17
elas
0.15
imos
0.14
bot
0.14
647
0.14
ergus
0.14
eler
0.14
bott
0.13
452
0.13
Activations Density 0.190%