INDEX
Explanations
mentions of "our" or possessive forms in a text
expressions of gratitude and appreciation
New Auto-Interp
Negative Logits
puff
-0.78
tar
-0.77
bender
-0.77
icter
-0.75
atican
-0.70
conom
-0.68
REUTERS
-0.68
more
-0.68
netflix
-0.67
contradicts
-0.66
POSITIVE LOGITS
selves
1.47
own
1.22
respective
0.98
collective
0.97
selves
0.94
asses
0.93
adversaries
0.90
dear
0.90
motto
0.86
beloved
0.84
Activations Density 0.123%