INDEX
Explanations
pronouns and the actions associated with them
New Auto-Interp
Negative Logits
lys
-0.17
thereby
-0.16
arse
-0.16
orf
-0.15
ucu
-0.14
indy
-0.14
stag
-0.14
dal
-0.14
ibt
-0.14
beg
-0.13
POSITIVE LOGITS
ê¶ģ
0.15
alone
0.15
enthal
0.15
imits
0.14
endoza
0.14
itted
0.14
_tE
0.14
IFn
0.14
_tF
0.14
cade
0.14
Activations Density 0.278%