INDEX
Explanations
phrases indicating a negative stance or denial
New Auto-Interp
Negative Logits
towed
-0.78
crowned
-0.70
imir
-0.66
sped
-0.66
rex
-0.65
tossed
-0.65
flung
-0.64
stabilized
-0.63
knocked
-0.63
bombed
-0.62
POSITIVE LOGITS
xious
1.05
except
0.90
oses
0.88
obs
0.86
excuses
0.85
ct
0.85
doubt
0.84
discern
0.82
meaningful
0.82
THING
0.80
Activations Density 0.050%