INDEX
Explanations
the letter "B" in various contexts
New Auto-Interp
Negative Logits
uddy
-0.22
ottle
-0.21
onds
-0.21
ROKE
-0.19
ike
-0.18
á»Ļ
-0.18
unny
-0.18
REW
-0.17
ihar
-0.17
roke
-0.16
POSITIVE LOGITS
erts
0.19
amber
0.18
orch
0.18
ick
0.17
zd
0.17
hatt
0.17
hat
0.17
ens
0.16
idel
0.16
ly
0.16
Activations Density 0.037%