INDEX
Explanations
negative emotional descriptors and expressions of suffering
New Auto-Interp
Negative Logits
ÌĢ
-0.16
tou
-0.16
о
-0.15
ÙIJÙĥ
-0.15
iesta
-0.14
а
-0.14
hete
-0.14
æ
-0.13
quali
-0.13
tic
-0.13
POSITIVE LOGITS
ÙĪ
0.24
ØĮ
0.22
Ú©
0.21
ب
0.21
بر
0.20
ÙIJ
0.20
با
0.20
س
0.20
ÙħÙĨ
0.20
âĢĮ
0.20
Activations Density 0.006%