INDEX
Explanations
various phrases indicating alternatives or differing perspectives
New Auto-Interp
Negative Logits
urated
-0.84
usters
-0.81
osponsors
-0.74
gur
-0.68
oru
-0.65
ardy
-0.64
andise
-0.62
ilings
-0.62
mented
-0.62
ĸļ
-0.61
POSITIVE LOGITS
sucks
0.71
else
0.70
though
0.68
imaginable
0.66
depends
0.65
soever
0.64
,
0.62
however
0.61
wors
0.60
we
0.60
Activations Density 0.008%