INDEX
Explanations
phrases indicating existence or presence
New Auto-Interp
Negative Logits
little
-0.16
cri
-0.15
nothing
-0.14
atrix
-0.14
Gould
-0.14
arris
-0.14
much
-0.14
uant
-0.14
anton
-0.13
ovenant
-0.13
POSITIVE LOGITS
elve
0.19
jich
0.17
ç´ł
0.17
deaux
0.15
_FT
0.15
uni
0.15
ITTER
0.15
fewer
0.15
inas
0.14
Å¡ÃŃch
0.14
Activations Density 0.034%