INDEX
Explanations
expressions of concern and apathy towards others and their feelings
New Auto-Interp
Negative Logits
anner
-0.19
کارÛĮ
-0.15
ersen
-0.15
arest
-0.14
ova
-0.14
ServiceProvider
-0.14
itude
-0.14
ader
-0.14
IDDLE
-0.14
reu
-0.14
POSITIVE LOGITS
mue
0.16
ós
0.16
endir
0.15
whether
0.14
.chk
0.14
fur
0.14
agus
0.13
indir
0.13
ingly
0.13
죽
0.13
Activations Density 0.040%