INDEX
Explanations
expressions related to participation and involvement in various activities
New Auto-Interp
Negative Logits
pornos
-0.10
âĪı
-0.09
abwe
-0.09
огÑĢа
-0.09
ůl
-0.09
галÑĸ
-0.09
_contrib
-0.09
erus
-0.09
vise
-0.08
erva
-0.08
POSITIVE LOGITS
id
0.07
oneself
0.06
and
0.06
ment
0.06
SD
0.06
1
0.06
aling
0.06
f
0.06
illing
0.06
itude
0.06
Activations Density 0.127%