INDEX
Explanations
expressions of friendliness and supportive social interactions
New Auto-Interp
Negative Logits
ors
-0.20
CCA
-0.16
ÑĨÑİ
-0.15
åĸľ
-0.15
iled
-0.14
sf
-0.14
576
-0.14
ساÙĨ
-0.14
uiltin
-0.14
odzi
-0.14
POSITIVE LOGITS
lier
0.24
liness
0.21
confines
0.20
liest
0.19
lies
0.18
ships
0.17
disposed
0.17
faces
0.17
ness
0.17
enough
0.17
Activations Density 0.023%