INDEX
Explanations
instances of attributed speech or reporting phrases
New Auto-Interp
Negative Logits
spin
-0.18
spit
-0.15
Spin
-0.15
Gil
-0.14
sch
-0.14
pr
-0.14
Mou
-0.14
esign
-0.14
ancial
-0.13
mou
-0.13
POSITIVE LOGITS
ách
0.16
tems
0.16
ymi
0.15
/values
0.14
zug
0.14
ato
0.14
dcc
0.14
_abort
0.14
writes
0.14
atars
0.13
Activations Density 0.037%