INDEX
Explanations
phrases expressing personal opinions or comparisons
comparisons or similes
New Auto-Interp
Negative Logits
oust
-0.83
uty
-0.78
uid
-0.74
inion
-0.73
itles
-0.73
olphin
-0.70
nerg
-0.70
OE
-0.70
ulic
-0.70
vantage
-0.69
POSITIVE LOGITS
lier
1.05
crap
1.02
something
0.94
someone
0.88
somebody
0.84
an
0.83
it
0.83
they
0.82
shit
0.82
a
0.82
Activations Density 0.060%