INDEX
Explanations
occurrences of the word "og"
New Auto-Interp
Negative Logits
y
-0.24
o
-0.19
g
-0.18
yb
-0.17
nell
-0.16
nelle
-0.15
yne
-0.15
sip
-0.15
eton
-0.15
s
-0.15
POSITIVE LOGITS
ues
0.30
lio
0.26
ei
0.25
ging
0.25
gers
0.24
eo
0.24
gy
0.23
ey
0.23
lu
0.22
getto
0.21
Activations Density 0.023%