INDEX
Explanations
phrases expressing negation or exclusion
negations indicating opposition or resistance to various ideas and actions
New Auto-Interp
Negative Logits
eatures
-0.71
ilar
-0.71
atural
-0.68
Annotations
-0.65
verified
-0.64
legit
-0.63
INAL
-0.62
lycer
-0.62
mentioned
-0.62
ensis
-0.62
POSITIVE LOGITS
compl
1.02
succumb
0.96
shortcuts
0.91
tolerate
0.91
scapego
0.91
hesitate
0.90
compromises
0.90
compromise
0.88
excuses
0.88
shy
0.87
Activations Density 0.318%