INDEX
Explanations
pronouns that refer to subjects or objects in sentences
New Auto-Interp
Negative Logits
Bliss
-0.14
rics
-0.14
Strict
-0.14
loys
-0.14
apan
-0.14
éľŀ
-0.14
eland
-0.14
nock
-0.13
ît
-0.13
gd
-0.13
POSITIVE LOGITS
am
0.23
ching
0.19
used
0.19
all
0.18
achi
0.18
AMI
0.18
boiling
0.17
ultimately
0.17
Takes
0.17
took
0.17
Activations Density 0.137%