INDEX
Explanations
names of political figures and terms related to them
proper nouns, particularly names and locations
New Auto-Interp
Negative Logits
istically
-0.74
士
-0.74
åĬ
-0.73
ashtra
-0.72
Hastings
-0.71
icity
-0.70
EMBER
-0.70
RAW
-0.69
utherford
-0.69
OPLE
-0.69
POSITIVE LOGITS
kens
0.94
wana
0.77
bles
0.76
kered
0.75
wash
0.74
bled
0.74
yip
0.74
virt
0.71
pload
0.71
bah
0.71
Activations Density 0.031%