INDEX
Explanations
phrases indicating an acknowledgment of errors or mistakes
phrases related to guilt and truthfulness
New Auto-Interp
Negative Logits
accompany
-0.75
lde
-0.72
accompanying
-0.72
Flavoring
-0.72
elve
-0.70
greets
-0.70
Cooldown
-0.69
ingle
-0.69
»Ĵ
-0.67
phabet
-0.66
POSITIVE LOGITS
wrong
1.90
Wrong
1.68
wrong
1.64
incorrect
1.55
mistake
1.42
faulty
1.39
wrongly
1.37
mistaken
1.33
Wr
1.32
misleading
1.32
Activations Density 0.667%