INDEX
Explanations
words related to preferences
terms related to preferences and choices
New Auto-Interp
Negative Logits
manship
-0.76
mberg
-0.74
wordpress
-0.67
tics
-0.67
angel
-0.66
STD
-0.66
bane
-0.66
meal
-0.65
Bam
-0.65
icago
-0.65
POSITIVE LOGITS
preferences
1.06
favoring
0.96
preference
0.86
favoured
0.85
eering
0.82
favored
0.77
pane
0.77
ļéĨĴ
0.76
favors
0.75
ately
0.75
Activations Density 0.032%