INDEX
Explanations
instructions or recommendations to perform specific actions
phrases that emphasize reminders or suggestions
New Auto-Interp
Negative Logits
gery
-0.76
alist
-0.74
impl
-0.72
rock
-0.69
pher
-0.68
cience
-0.66
folk
-0.66
alt
-0.64
heres
-0.62
bern
-0.62
POSITIVE LOGITS
Availability
0.71
icio
0.71
beforehand
0.69
quished
0.66
Siren
0.65
thous
0.63
reau
0.62
caveats
0.62
!:
0.62
yip
0.61
Activations Density 0.068%