INDEX
Explanations
phrases indicating outcomes or conclusions
New Auto-Interp
Negative Logits
oria
-0.16
onis
-0.16
tery
-0.16
/english
-0.15
vertiser
-0.15
etten
-0.15
duk
-0.15
.ejb
-0.14
orex
-0.14
екÑĥ
-0.14
POSITIVE LOGITS
boil
0.30
boils
0.28
boiled
0.25
Bo
0.23
boiling
0.22
down
0.21
_bo
0.19
boz
0.19
bo
0.19
Bo
0.18
Activations Density 0.099%