INDEX
Explanations
the letter 'w' in various contexts within the text
New Auto-Interp
Negative Logits
yg
-0.20
r
-0.19
unw
-0.17
rav
-0.17
ÙĦ
-0.17
il
-0.17
ر
-0.17
y
-0.16
rang
-0.16
ys
-0.16
POSITIVE LOGITS
ester
0.23
arden
0.21
ondrous
0.21
alled
0.20
iser
0.20
ares
0.20
avy
0.20
alter
0.20
iley
0.19
asser
0.18
Activations Density 0.015%