INDEX
Explanations
references to authority figures or leadership roles
New Auto-Interp
Negative Logits
hed
-0.18
éru
-0.16
adh
-0.16
ร
-0.15
ØŃت
-0.14
andum
-0.14
795
-0.14
vast
-0.14
adf
-0.14
ãĥ³ãĥĩãĤ£
-0.14
POSITIVE LOGITS
anova
0.24
(es
0.22
ial
0.20
-worker
0.19
/exec
0.19
å¨ĺ
0.19
eldorf
0.19
iale
0.18
dom
0.18
iali
0.17
Activations Density 0.028%