INDEX
Explanations
commands or phrases that prompt action or engagement
New Auto-Interp
Negative Logits
ove
-0.16
oola
-0.16
rawer
-0.15
ufe
-0.15
pu
-0.15
agues
-0.14
idis
-0.14
ocale
-0.14
ruh
-0.14
ADX
-0.14
POSITIVE LOGITS
down
0.24
amongst
0.23
cracking
0.20
busy
0.20
active
0.19
-to
0.19
thee
0.18
down
0.18
suited
0.18
together
0.18
Activations Density 0.054%