INDEX
Explanations
various forms of the prefix "dis" or words related to negative or undesirable outcomes
New Auto-Interp
Negative Logits
hind
-0.18
iley
-0.17
ong
-0.15
kad
-0.15
scriber
-0.15
jet
-0.15
kap
-0.14
het
-0.14
uples
-0.14
azon
-0.14
POSITIVE LOGITS
ellaneous
0.19
rael
0.19
naire
0.17
¼
0.16
gow
0.15
emean
0.15
ment
0.14
ettes
0.14
/dis
0.14
keit
0.14
Activations Density 0.083%