INDEX
Explanations
words related to the concept of 'self'
instances of the word "self" in different contexts
New Auto-Interp
Negative Logits
Flags
-0.81
Rabbit
-0.72
Powers
-0.72
Decay
-0.68
Pose
-0.68
Shot
-0.66
Crus
-0.64
Paradise
-0.63
Canary
-0.63
Barrier
-0.62
POSITIVE LOGITS
actory
1.09
onso
1.01
rint
0.99
lf
0.97
ibrary
0.96
enn
0.92
bour
0.91
poons
0.89
andom
0.89
ood
0.89
Activations Density 0.007%