INDEX
Explanations
instances of yes or no responses
affirmative responses or expressions of agreement
New Auto-Interp
Negative Logits
kefeller
-0.70
aign
-0.60
uese
-0.59
artney
-0.57
illin
-0.56
drawn
-0.55
agra
-0.55
riers
-0.55
Citiz
-0.54
gins
-0.53
POSITIVE LOGITS
sir
1.11
!
1.04
.
0.98
Absolutely
0.95
Absolutely
0.93
!.
0.91
!!!
0.86
!!!!
0.86
yes
0.85
!!
0.85
Activations Density 0.169%