INDEX
Explanations
phrases signifying comparisons or contrasts
New Auto-Interp
Negative Logits
-other
-0.23
attorney
-0.23
ambulance
-0.22
ambush
-0.22
asshole
-0.22
actress
-0.22
attack
-0.21
appointment
-0.21
apartment
-0.21
assistant
-0.21
POSITIVE LOGITS
few
0.23
(n
0.23
.k
0.21
couple
0.21
variety
0.20
[n
0.20
handful
0.20
particular
0.20
irt
0.20
irm
0.19
Activations Density 1.473%