INDEX
Explanations
mentions of political affiliations and governmental changes
New Auto-Interp
Negative Logits
rám
-0.15
arges
-0.15
arge
-0.15
\Modules
-0.14
oux
-0.14
ARGE
-0.14
onden
-0.14
ãĥ¼ãĥľ
-0.13
ofire
-0.13
ample
-0.13
POSITIVE LOGITS
allegiance
0.42
loyalty
0.40
loyal
0.36
loy
0.34
Loy
0.33
alignment
0.32
align
0.32
switch
0.30
switching
0.30
alleg
0.30
Activations Density 0.264%