INDEX
Explanations
phrases related to choices or options
the presence of the end-of-text token
New Auto-Interp
Negative Logits
Vaugh
-0.68
Jagu
-0.67
evidence
-0.62
iments
-0.57
Adin
-0.57
ATURES
-0.55
anism
-0.55
Edit
-0.54
achu
-0.54
agree
-0.54
POSITIVE LOGITS
lot
0.96
bunch
0.89
couple
0.82
handful
0.81
plethora
0.77
huge
0.76
uras
0.76
few
0.75
glimpse
0.75
whopping
0.74
Activations Density 0.609%