INDEX
Explanations
words related to updating, changing, and influencing
forms of "to be"
New Auto-Interp
Negative Logits
is
-0.91
does
-0.77
has
-0.77
knows
-0.72
goes
-0.71
gets
-0.69
takes
-0.66
realizes
-0.66
becomes
-0.66
begins
-0.65
POSITIVE LOGITS
were
1.40
are
1.27
weren
1.16
WERE
1.13
were
1.12
ARE
1.02
aren
0.97
Were
0.96
Were
0.91
are
0.91
Activations Density 4.196%