INDEX
Explanations
phrases indicating progress or improvement
phrases indicating progress or distance traveled toward a goal
New Auto-Interp
Negative Logits
urated
-0.80
iries
-0.70
ividual
-0.63
uala
-0.63
ulhu
-0.63
icist
-0.60
rones
-0.59
zinski
-0.58
pairs
-0.58
iasco
-0.58
POSITIVE LOGITS
toward
0.91
towards
0.87
WARD
0.74
fare
0.69
Towards
0.68
lier
0.66
Sabha
0.64
Drawn
0.63
finder
0.63
separating
0.62
Activations Density 0.033%