INDEX
Explanations
references to violence and events related to the LGBT community
New Auto-Interp
Negative Logits
ysz
-0.18
Malay
-0.17
Malaysian
-0.15
Malaysia
-0.15
коÑĢол
-0.14
Boone
-0.14
ź
-0.14
lew
-0.14
Indonesian
-0.14
Mohammed
-0.13
POSITIVE LOGITS
Georgia
0.40
Georgian
0.39
Georgia
0.36
гÑĢÑĥз
0.29
áĥ
0.27
Kak
0.27
Bat
0.25
Georg
0.24
Caucas
0.24
Caucasian
0.24
Activations Density 0.013%