INDEX
Explanations
words related to negation or reversal, often in the form of prefixes like "un-" or "anti-"
words related to undignified or unrefined behavior
New Auto-Interp
Negative Logits
anwhile
-0.80
ppo
-0.79
Ĥİ
-0.75
auga
-0.74
å§«
-0.73
tsky
-0.73
azines
-0.72
bley
-0.72
zzo
-0.72
ramid
-0.69
POSITIVE LOGITS
etermined
1.09
oubt
1.04
iscovered
1.03
aunted
0.99
irect
0.96
ec
0.95
epend
0.93
ifferent
0.91
amed
0.88
etermin
0.87
Activations Density 0.011%