INDEX
Explanations
phrases indicating negation or potential misunderstanding
phrases that express caution or reassurance
New Auto-Interp
Negative Logits
ilogy
-0.76
anded
-0.68
alist
-0.66
ially
-0.64
figured
-0.64
azo
-0.62
ranch
-0.62
ettlement
-0.62
atar
-0.62
erial
-0.61
POSITIVE LOGITS
yourself
1.05
yourselves
1.02
anymore
0.93
Yourself
0.85
ANY
0.80
any
0.79
your
0.75
whining
0.70
anything
0.70
YOUR
0.68
Activations Density 0.123%