INDEX
Explanations
personal pronouns or possessive pronouns associated with a sense of self
references to feelings and personal experiences
New Auto-Interp
Negative Logits
Us
-0.68
themselves
-0.66
Their
-0.63
Their
-0.61
Plaint
-0.59
arsen
-0.58
idates
-0.57
us
-0.57
tariffs
-0.56
Diff
-0.55
POSITIVE LOGITS
myself
1.45
blogging
0.93
my
0.82
typing
0.72
OCD
0.71
watching
0.69
researching
0.68
writing
0.68
personally
0.66
aido
0.66
Activations Density 1.006%