INDEX
Explanations
words related to a feeling of moral or factual incorrectness
statements indicating something is considered incorrect or morally wrong
New Auto-Interp
Negative Logits
rien
-0.66
¯¯¯¯
-0.65
ribes
-0.63
cit
-0.63
anned
-0.62
apple
-0.62
glas
-0.62
usters
-0.61
ility
-0.61
é¾
-0.60
POSITIVE LOGITS
wrong
1.02
unfocusedRange
0.89
wrong
0.86
culprit
0.80
fully
0.77
ibrary
0.75
mistaken
0.74
Wrong
0.74
tack
0.73
eous
0.70
Activations Density 0.014%