INDEX
Explanations
references to political ideologies, particularly those associated with the left and right
New Auto-Interp
Negative Logits
jee
-0.17
actal
-0.17
uled
-0.16
fisse
-0.15
.inflate
-0.15
ossal
-0.15
jure
-0.15
ipers
-0.15
ration
-0.15
anse
-0.15
POSITIVE LOGITS
wing
0.41
-wing
0.40
wing
0.39
Wing
0.33
ward
0.32
翼
0.29
ist
0.29
ists
0.28
wings
0.27
-leaning
0.25
Activations Density 0.009%