INDEX
Explanations
mentions of winning or achieving success in competitions or games
phrases that indicate completeness or totality
New Auto-Interp
Negative Logits
grad
-0.68
aminer
-0.56
EVERY
-0.56
stone
-0.56
robe
-0.56
Malf
-0.55
lav
-0.55
sometimes
-0.54
lad
-0.52
plin
-0.51
POSITIVE LOGITS
ocating
1.04
usions
0.97
uding
0.94
three
0.92
igator
0.92
udes
0.89
ocation
0.88
four
0.88
ogene
0.88
ocations
0.84
Activations Density 0.106%