INDEX
Explanations
punctuation marks at the end of sentences
New Auto-Interp
Negative Logits
deceptive
-0.61
confidentiality
-0.57
whistleblowers
-0.57
privacy
-0.57
outgoing
-0.56
safety
-0.56
advis
-0.55
confidential
-0.54
trusted
-0.54
stewards
-0.53
POSITIVE LOGITS
imgur
1.01
e
0.91
aca
0.75
hat
0.75
seed
0.74
MX
0.71
hs
0.71
minimum
0.71
¼
0.70
medium
0.70
Activations Density 0.021%