INDEX
Explanations
mentions of winning or achievements
instances of the word "win."
New Auto-Interp
Negative Logits
erity
-0.75
umn
-0.64
ikk
-0.64
protr
-0.62
includ
-0.62
Uz
-0.61
footprint
-0.60
Else
-0.60
ciplinary
-0.60
assembled
-0.59
POSITIVE LOGITS
ners
0.91
nings
0.87
now
0.76
win
0.76
throp
0.75
ception
0.74
iors
0.74
iem
0.74
ces
0.72
't
0.72
Activations Density 0.026%