INDEX
Explanations
affirmative responses indicating agreement or acknowledgment
New Auto-Interp
Negative Logits
mis
-0.59
rif
-0.58
estor
-0.58
://
-0.57
larak
-0.57
ny
-0.57
:\/\/
-0.56
Mü
-0.56
sive
-0.55
Berk
-0.55
POSITIVE LOGITS
YEAH
1.90
Yeah
1.87
Yeah
1.84
yeah
1.83
yeah
1.71
YEAH
1.67
Yep
1.40
Yep
1.33
Yea
1.32
yep
1.28
Activations Density 0.064%