INDEX
Explanations
requests or instructions written with a polite tone
occurrences of the word "please" in various formats
New Auto-Interp
Negative Logits
lings
-0.76
pires
-0.74
é¾
-0.74
arthed
-0.71
arc
-0.70
cler
-0.69
IUM
-0.65
MpServer
-0.65
visor
-0.64
ARC
-0.62
POSITIVE LOGITS
Ignore
1.07
beware
1.04
forgive
1.03
note
0.98
ignore
0.95
advise
0.93
excuse
0.92
refrain
0.92
enable
0.89
consider
0.88
Activations Density 0.033%