INDEX
Explanations
phrases indicating surprise or disbelief
the phrase "don't even" and variations emphasizing denial or lack of awareness
New Auto-Interp
Negative Logits
rend
-0.81
ugal
-0.75
cipl
-0.70
ubi
-0.67
runtime
-0.66
ãĤº
-0.66
=-=-=-=-=-=-=-=-
-0.63
acha
-0.63
edient
-0.63
only
-0.62
POSITIVE LOGITS
remotely
1.33
bothering
1.03
bothered
0.99
bother
0.98
close
0.84
mentioning
0.83
scratch
0.81
mention
0.81
halfway
0.80
pretend
0.79
Activations Density 0.057%