INDEX
Explanations
references to leftist ideologies and movements
New Auto-Interp
Negative Logits
ously
-0.18
egin
-0.17
rna
-0.17
ptions
-0.16
erate
-0.15
aeper
-0.15
rung
-0.15
risk
-0.14
edin
-0.14
itag
-0.14
POSITIVE LOGITS
ward
0.25
wards
0.22
-hand
0.22
ness
0.20
/right
0.20
most
0.18
ened
0.18
ت
0.18
-wing
0.17
eous
0.17
Activations Density 0.034%