INDEX
Explanations
occurrences of the token "-st" across various contexts
New Auto-Interp
Negative Logits
rw
-0.25
rut
-0.23
rh
-0.20
rane
-0.20
ré
-0.20
rar
-0.20
r
-0.20
rin
-0.20
rig
-0.19
rax
-0.19
POSITIVE LOGITS
udio
0.35
rength
0.34
atement
0.33
roke
0.32
reet
0.32
udy
0.31
rike
0.31
arter
0.30
ories
0.30
ripe
0.30
Activations Density 0.015%