INDEX
Explanations
phrases related to discrimination and negative attitudes towards particular groups
references to homophobia and related social issues
New Auto-Interp
Negative Logits
Chocobo
-0.72
tnc
-0.71
ibaba
-0.70
atche
-0.70
istors
-0.70
olicited
-0.68
ujah
-0.67
uce
-0.65
agra
-0.64
zinski
-0.64
POSITIVE LOGITS
§
0.78
urst
0.68
stadt
0.66
ĭ
0.65
esse
0.64
resy
0.64
esy
0.63
y
0.63
ĩ
0.63
=-=-=-=-
0.62
Activations Density 0.042%