INDEX
Explanations
statements followed by a negation
negations or phrases indicating the absence of something
New Auto-Interp
Negative Logits
former
-0.74
çļ
-0.69
Redditor
-0.69
umbn
-0.69
ourses
-0.69
send
-0.67
papers
-0.65
rift
-0.64
quet
-0.63
aviour
-0.63
POSITIVE LOGITS
uncommon
1.37
icable
1.11
clear
1.09
unreasonable
1.09
surprising
1.08
necessarily
1.05
easy
1.03
advisable
0.99
feasible
0.96
unusual
0.95
Activations Density 0.079%