INDEX
Explanations
phrases related to self-identification or attribution of identity
phrases indicating self-identification, especially in relation to gender and identity
New Auto-Interp
Negative Logits
ersen
-0.85
ysc
-0.84
terness
-0.71
erest
-0.70
TPS
-0.69
Plex
-0.69
TL
-0.68
ttp
-0.67
ipl
-0.67
Side
-0.66
POSITIVE LOGITS
belonging
0.89
pires
0.83
pired
0.77
follows
0.69
pers
0.68
Die
0.68
©¶æ
0.68
Commando
0.67
Burk
0.65
Nig
0.64
Activations Density 0.065%