INDEX
Explanations
references to the concept of "self" or self-control
references to the concept of "self" or self-related terms
New Auto-Interp
Negative Logits
cill
-0.73
aldi
-0.72
cot
-0.69
lu
-0.68
cli
-0.67
cape
-0.64
lam
-0.64
mole
-0.63
gypt
-0.62
opers
-0.61
POSITIVE LOGITS
Self
3.35
self
1.63
self
1.56
Self
1.53
selves
1.20
Mutual
1.00
selves
0.98
Personality
0.98
Narc
0.96
Subtle
0.93
Activations Density 0.005%