INDEX
Explanations
references to authority figures, specifically in socio-political contexts
New Auto-Interp
Negative Logits
Z
-0.17
Arts
-0.15
30
-0.15
381
-0.15
pies
-0.14
-n
-0.14
colon
-0.14
ertest
-0.14
-Z
-0.14
erea
-0.14
POSITIVE LOGITS
lems
0.17
oS
0.15
ec
0.14
ùy
0.14
.utf
0.14
ека
0.14
ведиÑĤе
0.14
Phoenix
0.14
Rak
0.13
SI
0.13
Activations Density 0.038%