INDEX
Explanations
references to nobility or aristocracy
New Auto-Interp
Negative Logits
acen
-0.18
cı
-0.18
Fritz
-0.17
icked
-0.16
PIX
-0.15
eldon
-0.15
agh
-0.15
quin
-0.15
abaj
-0.15
ocab
-0.14
POSITIVE LOGITS
les
0.32
lemen
0.30
ility
0.29
odies
0.26
LES
0.23
iliary
0.22
ilities
0.21
bery
0.19
ilis
0.19
bled
0.18
Activations Density 0.005%