INDEX
Explanations
words related to communication with emphasis or affirmation
conversational phrases that express humor or sarcasm
New Auto-Interp
Negative Logits
pires
-0.73
minster
-0.66
Interestingly
-0.63
³
-0.62
abase
-0.61
Hack
-0.60
Result
-0.60
aneous
-0.60
lier
-0.59
Page
-0.59
POSITIVE LOGITS
please
1.02
_-
1.00
lest
1.00
unless
0.99
because
0.89
unless
0.87
they
0.84
it
0.83
we
0.83
otherwise
0.82
Activations Density 0.267%