INDEX
Explanations
requests or instructions in the text
New Auto-Interp
Negative Logits
pires
-0.76
lings
-0.71
é¾
-0.70
visor
-0.69
IUM
-0.69
arc
-0.69
cler
-0.69
laus
-0.66
MpServer
-0.65
bent
-0.65
POSITIVE LOGITS
Ignore
1.00
forgive
0.96
beware
0.96
note
0.95
excuse
0.95
enable
0.90
ignore
0.90
refrain
0.89
advise
0.89
disregard
0.88
Activations Density 0.433%