INDEX
Explanations
the word "preferred" in a sentence
references to choices or preferences
New Auto-Interp
Negative Logits
????????
-0.75
Tours
-0.67
circle
-0.67
humans
-0.67
ences
-0.67
lifting
-0.66
iry
-0.65
Revival
-0.63
Unlock
-0.62
corruption
-0.61
POSITIVE LOGITS
preferred
3.93
favoured
2.15
desired
1.95
Preferred
1.95
favored
1.94
preference
1.93
preferable
1.90
prefers
1.77
prefer
1.74
disliked
1.64
Activations Density 0.008%