INDEX
Explanations
names or terms related to geographical locations or political figures
proper nouns, specifically names related to individuals or places
New Auto-Interp
Negative Logits
uer
-0.81
ERAL
-0.80
uers
-0.80
otine
-0.75
ossom
-0.73
AME
-0.73
eral
-0.72
Niet
-0.71
erest
-0.71
ervatives
-0.69
POSITIVE LOGITS
Khan
0.87
onboard
0.71
calibr
0.68
hypers
0.67
padd
0.67
simulations
0.66
simulator
0.66
calibrated
0.66
jaw
0.65
cockpit
0.65
Activations Density 0.001%