INDEX
Explanations
questions starting with "why."
New Auto-Interp
Negative Logits
ffa
-0.16
onse
-0.15
amework
-0.15
leine
-0.14
sters
-0.14
UNET
-0.14
ners
-0.14
ivate
-0.14
aurus
-0.14
wins
-0.13
POSITIVE LOGITS
ever
0.23
alla
0.19
do
0.19
did
0.18
does
0.17
te
0.17
bother
0.16
else
0.16
waste
0.16
not
0.16
Activations Density 0.023%