INDEX
Explanations
pronouns and verbs related to self-identity
references to identity and self-perception
New Auto-Interp
Negative Logits
artifacts
-0.62
diffusion
-0.61
Grounds
-0.60
liner
-0.60
hill
-0.59
Passage
-0.57
horizont
-0.57
UCT
-0.56
sheds
-0.55
adv
-0.55
POSITIVE LOGITS
borgh
0.76
become
0.72
am
0.71
uably
0.70
Thumbnail
0.68
Become
0.66
truly
0.66
ãĤ«
0.63
pretended
0.62
aspire
0.62
Activations Density 0.117%