INDEX
Explanations
contractions combined with the negation "not"
negations or phrases indicating disagreement
New Auto-Interp
Negative Logits
fixme
-0.68
ership
-0.66
IER
-0.65
Evaluation
-0.64
velt
-0.61
ilage
-0.61
inav
-0.61
ilege
-0.60
decency
-0.59
cano
-0.59
POSITIVE LOGITS
alone
1.29
shy
1.17
amused
1.07
afraid
1.02
immune
1.01
necessarily
0.96
exactly
0.93
ashamed
0.92
Alone
0.92
kidding
0.91
Activations Density 0.103%