INDEX
Explanations
questions that express confusion or challenge the status quo
New Auto-Interp
Negative Logits
successive
-0.84
selective
-0.71
sustained
-0.67
environmental
-0.66
etheless
-0.66
continued
-0.64
gradual
-0.63
delays
-0.62
surplus
-0.62
fewer
-0.62
POSITIVE LOGITS
fuck
0.89
soType
0.88
abouts
0.79
ãĤ§
0.78
isSpecialOrderable
0.78
fork
0.75
;)
0.74
Fuck
0.73
Ñı
0.73
bang
0.72
Activations Density 0.195%