INDEX
Explanations
instances of personal decisions or opinions in text
New Auto-Interp
Negative Logits
themselves
-0.68
respectively
-0.64
Their
-0.60
their
-0.59
EMS
-0.55
Autob
-0.55
alike
-0.54
allegedly
-0.54
Their
-0.54
their
-0.53
POSITIVE LOGITS
myself
1.54
my
0.91
poke
0.72
ograp
0.64
personally
0.61
fuckin
0.61
chair
0.58
MY
0.58
milo
0.57
laughs
0.56
Activations Density 0.896%