INDEX
Explanations
possessive pronouns and references to ownership or authorship
New Auto-Interp
Negative Logits
fuse
-0.16
istrovstvÃŃ
-0.15
tato
-0.14
alse
-0.14
ussy
-0.14
bee
-0.14
èĭ±æĸĩ
-0.14
stroy
-0.14
Bris
-0.14
Rivera
-0.13
POSITIVE LOGITS
approach
0.17
rens
0.16
proposal
0.16
utan
0.15
eph
0.15
findings
0.15
earer
0.15
POSITE
0.14
gre
0.14
colleague
0.14
Activations Density 0.138%