INDEX
Explanations
instances of the word "one"
New Auto-Interp
Negative Logits
rosse
-0.08
ffi
-0.07
inois
-0.07
aney
-0.07
roid
-0.06
IBC
-0.06
anca
-0.06
chten
-0.06
ruk
-0.06
pery
-0.06
POSITIVE LOGITS
of
0.08
cle
0.07
among
0.06
ixture
0.06
Kew
0.06
woke
0.06
Cad
0.06
among
0.05
Zu
0.05
favorite
0.05
Activations Density 0.013%