INDEX
Explanations
possessive pronouns related to the user
New Auto-Interp
Negative Logits
himself
-0.15
themselves
-0.15
lights
-0.15
ald
-0.15
positories
-0.14
istra
-0.14
erty
-0.14
ói
-0.14
447
-0.14
ino
-0.14
POSITIVE LOGITS
yourself
0.23
anmar
0.21
nger
0.20
essler
0.19
guys
0.19
opia
0.17
ths
0.17
Yourself
0.17
’re
0.16
zon
0.15
Activations Density 0.202%