INDEX
Explanations
personal statements or opinions starting with "I"
sentences that express personal identity or self-reference
New Auto-Interp
Negative Logits
tnc
-0.68
tains
-0.66
indistinguishable
-0.57
Rockefeller
-0.57
Gap
-0.55
Reverse
-0.54
groupon
-0.54
excess
-0.54
pires
-0.53
Philipp
-0.53
POSITIVE LOGITS
'm
1.45
've
1.31
dunno
1.22
'll
1.22
suppose
1.15
'd
1.06
nex
1.02
guess
1.01
WI
1.00
mean
0.97
Activations Density 0.233%