INDEX
Explanations
words related to social or political causes
references to causes related to various issues and events
New Auto-Interp
Negative Logits
aeper
-0.81
Ku
-0.72
Seym
-0.70
PDATE
-0.69
illet
-0.69
Leopard
-0.68
olitan
-0.68
aturdays
-0.67
lav
-0.66
ilings
-0.63
POSITIVE LOGITS
cele
1.32
cause
0.87
way
0.78
forge
0.74
wagon
0.71
facts
0.70
vier
0.70
celeb
0.70
ality
0.70
fare
0.70
Activations Density 0.030%