INDEX
Explanations
the word "Me" with varying degrees of emphasis indicated by activation strength
the repetition of the word "Me"
New Auto-Interp
Negative Logits
UAL
-0.72
ctl
-0.66
icably
-0.65
flush
-0.65
acing
-0.64
flush
-0.63
ulative
-0.62
ript
-0.62
itiveness
-0.62
OWER
-0.60
POSITIVE LOGITS
Me
3.48
Me
2.45
ME
2.04
me
1.77
Us
1.49
Meh
1.41
ME
1.36
Him
1.33
My
1.27
me
1.24
Activations Density 0.012%