INDEX
Explanations
phrases that indicate a call to action or directives
New Auto-Interp
Negative Logits
undles
-0.16
ypes
-0.15
utor
-0.15
rej
-0.14
dash
-0.14
abela
-0.14
eka
-0.14
okit
-0.14
ambi
-0.14
ily
-0.14
POSITIVE LOGITS
attention
0.26
quits
0.23
duty
0.23
ibrate
0.23
forth
0.22
oused
0.22
dib
0.21
Attention
0.21
action
0.20
Duty
0.20
Activations Density 0.048%