INDEX
Explanations
phrases indicating directness or explicitness
New Auto-Interp
Negative Logits
eport
-0.82
aido
-0.76
pload
-0.73
emis
-0.72
isure
-0.70
Pastebin
-0.69
kees
-0.68
=-=-=-=-=-=-=-=-
-0.68
lain
-0.68
nan
-0.65
POSITIVE LOGITS
rejection
0.77
refusal
0.70
contradicted
0.70
obliter
0.69
contradicts
0.69
contradict
0.68
disregard
0.68
ERROR
0.68
lie
0.66
butt
0.65
Activations Density 0.091%