INDEX
Explanations
phrases indicating obviousness or clarity
New Auto-Interp
Negative Logits
nown
-0.18
elon
-0.17
elman
-0.16
ocked
-0.16
plemented
-0.16
çµ¶
-0.15
hape
-0.14
ropoda
-0.14
pty
-0.14
————————
-0.14
POSITIVE LOGITS
unction
0.17
517
0.15
afi
0.14
!=(
0.14
ơn
0.14
afd
0.14
Gunn
0.13
enus
0.13
ours
0.13
uro
0.13
Activations Density 0.058%