INDEX
Explanations
references to social inequalities and power dynamics
New Auto-Interp
Negative Logits
apore
-0.08
alÄ±ÅŁ
-0.07
mercenaries
-0.06
ylvania
-0.06
isk
-0.06
_IA
-0.06
onde
-0.06
ANNER
-0.06
Margins
-0.06
prm
-0.06
POSITIVE LOGITS
powerful
0.15
VIP
0.14
senior
0.14
politicians
0.13
high
0.12
highest
0.12
top
0.12
Powerful
0.12
important
0.12
higher
0.12
Activations Density 0.050%