INDEX
Explanations
the presence of the word "op" in various forms
New Auto-Interp
Negative Logits
y
-0.38
h
-0.27
t
-0.27
s
-0.26
er
-0.26
ic
-0.25
e
-0.25
yb
-0.24
eurs
-0.20
yar
-0.20
POSITIVE LOGITS
posite
0.22
edia
0.22
ausal
0.22
portunity
0.21
lastic
0.19
las
0.19
ercul
0.18
inion
0.18
pi
0.18
corn
0.18
Activations Density 0.020%