INDEX
Explanations
comparisons indicating improvement or preference
comparative terms indicating preference or judgment
New Auto-Interp
Negative Logits
utm
-0.75
eur
-0.70
ulating
-0.65
ategory
-0.62
uly
-0.62
Greenpeace
-0.61
urer
-0.61
inous
-0.61
itory
-0.61
naires
-0.60
POSITIVE LOGITS
yet
0.99
than
0.93
Than
0.88
than
0.84
ment
0.81
still
0.79
behaved
0.75
luck
0.73
safe
0.71
bye
0.71
Activations Density 0.033%