INDEX
Explanations
phrases related to progression and initial actions
phrases indicating sequential actions or processes
New Auto-Interp
Negative Logits
ancies
-0.79
olls
-0.70
bugs
-0.66
comments
-0.65
resistant
-0.64
ustomed
-0.63
rums
-0.63
sleep
-0.61
noxious
-0.59
sung
-0.58
POSITIVE LOGITS
toward
1.07
towards
1.06
step
0.93
Steps
0.84
steps
0.84
nings
0.77
Towards
0.75
phase
0.74
hurdle
0.73
Step
0.72
Activations Density 0.091%