INDEX
Explanations
names related to politics
New Auto-Interp
Negative Logits
istically
-0.78
Constructed
-0.78
ivity
-0.72
akedown
-0.71
uing
-0.68
RIS
-0.66
uality
-0.66
uously
-0.65
ively
-0.64
ariat
-0.64
POSITIVE LOGITS
byss
1.34
tto
0.99
tti
0.96
nell
0.92
pedia
0.89
stein
0.88
cki
0.88
Verb
0.88
hound
0.86
mand
0.83
Activations Density 0.020%