INDEX
Explanations
queries framed with the word "what."
New Auto-Interp
Negative Logits
ummer
-0.15
thuáºŃn
-0.15
pad
-0.15
oubted
-0.15
anything
-0.15
å¤ļå°ij
-0.15
CC
-0.14
FFE
-0.14
suite
-0.14
ongs
-0.14
POSITIVE LOGITS
about
0.17
do
0.17
if
0.16
effect
0.15
exactly
0.15
wenn
0.14
ird
0.14
choice
0.14
Harden
0.14
isine
0.14
Activations Density 0.045%