INDEX
Explanations
phrases indicating choice or contrast between two options
references to the concept of choosing between alternatives or pairs
New Auto-Interp
Negative Logits
humans
-0.72
Psy
-0.68
ove
-0.67
Topics
-0.65
ships
-0.65
lux
-0.64
obin
-0.64
ories
-0.63
pir
-0.63
ARS
-0.62
POSITIVE LOGITS
worldly
0.86
lobe
0.84
equally
0.80
dayName
0.74
glance
0.74
baseman
0.74
wart
0.72
ngth
0.69
ingred
0.68
heastern
0.66
Activations Density 0.055%