INDEX
Explanations
mentions of verbs that describe some change in state or position
New Auto-Interp
Negative Logits
wrote
-1.52
took
-1.46
withdrew
-1.45
grew
-1.42
froze
-1.41
flew
-1.41
wore
-1.41
threw
-1.39
knew
-1.39
undertook
-1.37
POSITIVE LOGITS
taken
1.10
given
1.02
shown
0.89
done
0.86
seen
0.86
a
0.82
to
0.74
come
0.73
in
0.73
up
0.72
Activations Density 4.355%