INDEX
Explanations
references to concepts of "self" or identity
self and reflexive pronouns
New Auto-Interp
Negative Logits
Infórmanos
-0.80
Importing
-0.60
importing
-0.59
loopholes
-0.53
Dian
-0.53
נוס
-0.53
crates
-0.52
the
-0.52
cozin
-0.52
arenas
-0.52
POSITIVE LOGITS
Self
1.39
Self
1.30
SELF
1.27
self
1.27
self
1.26
SELF
1.20
selves
1.03
thyself
0.87
Yourself
0.84
himself
0.84
Activations Density 0.020%