INDEX
Explanations
words related to personal preferences or choices
references to personal preferences
New Auto-Interp
Negative Logits
bane
-0.81
mans
-0.76
kj
-0.74
mberg
-0.73
wordpress
-0.72
amaz
-0.71
Sus
-0.69
bold
-0.69
WARN
-0.69
Adams
-0.69
POSITIVE LOGITS
preferences
1.07
yip
0.96
preference
0.92
favoring
0.91
elig
0.84
favoured
0.80
selection
0.77
palate
0.76
skew
0.74
choice
0.74
Activations Density 0.012%