INDEX
Explanations
phrases indicating capability or ability
New Auto-Interp
Negative Logits
<bos>
-0.60
the
-0.54
your
-0.48
Schwartz
-0.48
in
-0.46
McCulloch
-0.45
these
-0.43
בח
-0.43
this
-0.43
The
-0.42
POSITIVE LOGITS
Able
0.99
Able
0.96
able
0.93
unable
0.80
IsMutable
0.76
Unable
0.76
unable
0.75
tagHelperRunner
0.74
Unable
0.74
<unused43>
0.74
Activations Density 0.011%