INDEX
Explanations
phrases related to certainty or confirmation
strong expressions of denial or disagreement
New Auto-Interp
Negative Logits
icipated
-0.78
Siber
-0.74
bathrooms
-0.68
Tec
-0.67
restrooms
-0.66
Jensen
-0.64
downstream
-0.61
populated
-0.61
transformer
-0.60
exploits
-0.60
POSITIVE LOGITS
Yeah
1.11
Yes
1.01
Hmm
0.99
Answer
0.99
Exactly
0.98
YES
0.96
Correct
0.94
sir
0.93
Absolutely
0.92
Exactly
0.91
Activations Density 0.533%