INDEX
Explanations
instances where a person is speaking or expressing themselves
instances of the pronoun "I"
New Auto-Interp
Negative Logits
PTS
-0.74
itol
-0.68
Ele
-0.62
imum
-0.62
Virtue
-0.61
ision
-0.61
lihood
-0.61
enges
-0.60
Concord
-0.59
airs
-0.59
POSITIVE LOGITS
've
1.22
'm
1.19
'll
1.18
dunno
1.07
forgot
1.06
'd
1.02
swear
0.93
RL
0.89
suppose
0.88
cheated
0.88
Activations Density 0.234%