INDEX
Explanations
adverbs ending in -ingly
phrases and structures related to deception and pretense
New Auto-Interp
Negative Logits
ourses
-0.85
otiation
-0.81
ests
-0.78
cox
-0.78
rer
-0.75
atl
-0.74
mentioned
-0.73
cies
-0.72
olphin
-0.72
summary
-0.71
POSITIVE LOGITS
invincible
0.85
unbeat
0.84
kindred
0.81
unstoppable
0.79
innocuous
0.76
benign
0.75
harmless
0.74
resemblance
0.72
spurious
0.71
immune
0.70
Activations Density 0.521%