INDEX
Explanations
words related to preference or support
instances of the word "favor" and its variations, indicating preference or support
New Auto-Interp
Negative Logits
sis
-0.88
ı
-0.83
pt
-0.80
thur
-0.79
mberg
-0.77
RT
-0.76
hid
-0.76
att
-0.75
gren
-0.75
raz
-0.72
POSITIVE LOGITS
favored
1.16
itism
1.07
favoring
0.99
favors
0.93
favoured
0.93
nesday
0.86
favorites
0.86
whipping
0.80
hitters
0.74
favorable
0.73
Activations Density 0.008%