INDEX
Explanations
quotation marks and their associated wording
New Auto-Interp
Negative Logits
mpar
-0.15
ÐIJÑĢÑħÑĸв
-0.15
EMPL
-0.14
empl
-0.14
remar
-0.13
âĢŀ
-0.12
“
-0.12
itial
-0.12
ãģŁãģı
-0.12
ANNOT
-0.12
POSITIVE LOGITS
Oh
0.27
oh
0.27
yeah
0.27
Yeah
0.26
ouch
0.25
Hey
0.24
hey
0.24
I
0.24
ugh
0.24
oops
0.24
Activations Density 0.125%