INDEX
Explanations
direct responses of negation or refusal
negative responses or denials
New Auto-Interp
Negative Logits
ortment
-0.72
reverted
-0.65
kefeller
-0.63
drawn
-0.60
rored
-0.60
sing
-0.58
pez
-0.58
ylan
-0.57
riers
-0.57
abal
-0.57
POSITIVE LOGITS
sir
0.96
Nope
0.91
terday
0.89
!
0.88
!,
0.83
!.
0.83
Absolutely
0.79
worries
0.78
!!!!
0.77
.
0.77
Activations Density 0.072%