INDEX
Explanations
derogatory or obscene language
New Auto-Interp
Negative Logits
sou
-0.17
ists
-0.16
ign
-0.16
borg
-0.14
onder
-0.14
imers
-0.14
argins
-0.14
803
-0.14
iom
-0.14
zet
-0.14
POSITIVE LOGITS
assic
0.15
IFO
0.15
anale
0.14
ãĦ
0.14
rgan
0.14
nop
0.14
gis
0.14
PTS
0.14
aukee
0.13
anske
0.13
Activations Density 0.060%