INDEX
Explanations
expressions related to societal expectations and personal identity
New Auto-Interp
Negative Logits
ibs
-0.18
mas
-0.18
rl
-0.16
maf
-0.15
ainer
-0.14
ible
-0.14
uan
-0.14
408
-0.14
ronics
-0.14
aporation
-0.14
POSITIVE LOGITS
everybody
0.21
everyone
0.20
everyone
0.20
Everyone
0.18
specific
0.17
individual
0.17
Everybody
0.17
æľĢæĸ°
0.16
ione
0.16
individ
0.16
Activations Density 0.009%