INDEX
Explanations
phrases expressing denial or rejection
New Auto-Interp
Negative Logits
Ú¯ÛĮ
-0.16
еÑģÑı
-0.15
hod
-0.15
arat
-0.15
iban
-0.14
fact
-0.14
enuine
-0.14
айд
-0.14
ós
-0.14
ador
-0.14
POSITIVE LOGITS
understand
0.17
deserve
0.17
rightly
0.15
know
0.15
belong
0.15
care
0.14
itsu
0.14
suppose
0.14
cha
0.14
176
0.14
Activations Density 0.039%