INDEX
Explanations
references to friendliness or positive social interactions
New Auto-Interp
Negative Logits
ors
-0.20
iled
-0.15
CCA
-0.15
ÑĨÑİ
-0.15
Dit
-0.15
eday
-0.15
veis
-0.15
sr
-0.14
875
-0.14
sf
-0.14
POSITIVE LOGITS
lier
0.25
liest
0.20
ships
0.18
liness
0.18
disposed
0.17
neighborhood
0.17
confines
0.17
acht
0.17
neighbourhood
0.16
Fam
0.16
Activations Density 0.015%