INDEX
Explanations
references to personal identity and relationships
New Auto-Interp
Negative Logits
itself
-0.18
Indented
-0.15
bor
-0.15
ema
-0.14
urma
-0.14
/effects
-0.14
himself
-0.14
Ïİν
-0.14
arlo
-0.14
rale
-0.14
POSITIVE LOGITS
differently
0.17
face
0.17
again
0.16
perform
0.16
doing
0.15
coming
0.15
DJ
0.15
/her
0.14
pard
0.14
smile
0.14
Activations Density 0.049%