INDEX
Explanations
requests or instructions prompting action or engagement
phrases that encourage actions or suggest following instructions
New Auto-Interp
Negative Logits
MpServer
-0.82
oub
-0.77
TPPStreamerBot
-0.74
iche
-0.73
bled
-0.71
onut
-0.70
ELD
-0.69
gery
-0.66
EStreamFrame
-0.66
lot
-0.65
POSITIVE LOGITS
checking
0.69
clicking
0.68
your
0.68
patience
0.67
yourselves
0.65
beforehand
0.64
Patrol
0.63
rity
0.63
you
0.62
Shogun
0.62
Activations Density 0.043%