INDEX
Explanations
phrases indicating instructions or requests in a polite manner
requests or prompts for action
New Auto-Interp
Negative Logits
arthed
-0.75
pires
-0.73
Tok
-0.68
76561
-0.68
Aust
-0.67
senal
-0.67
MpServer
-0.67
arc
-0.67
é¾
-0.67
byss
-0.66
POSITIVE LOGITS
note
1.16
refrain
1.09
forgive
1.08
Ignore
1.07
consider
1.07
enable
1.07
excuse
1.01
beware
1.00
refer
0.99
disregard
0.99
Activations Density 0.031%