INDEX
Explanations
mentions of actions or characteristics related to someone
references to the concept of "someone."
New Auto-Interp
Negative Logits
ories
-0.86
osterone
-0.81
èª
-0.70
eed
-0.68
ory
-0.68
inders
-0.67
heny
-0.67
DOS
-0.66
AIN
-0.66
irth
-0.65
POSITIVE LOGITS
else
1.73
Else
1.26
Else
1.06
else
1.04
WithNo
0.88
who
0.81
smugg
0.72
knowledgeable
0.72
unlucky
0.69
identifiable
0.68
Activations Density 0.039%