INDEX
Explanations
phrases emphasizing the importance of various actions or considerations
phrases emphasizing the significance of certain statements or actions
New Auto-Interp
Negative Logits
ILA
-0.74
uthor
-0.73
ãĤ¦ãĤ¹
-0.66
Carbuncle
-0.63
Ended
-0.62
opoly
-0.61
urus
-0.59
favorite
-0.57
bare
-0.56
ãĤ´ãĥ³
-0.56
POSITIVE LOGITS
that
0.96
to
0.89
enough
0.80
for
0.80
nonetheless
0.76
we
0.74
lest
0.69
that
0.66
expr
0.65
not
0.64
Activations Density 0.070%