INDEX
Explanations
derogatory or insulting terms used to describe people
New Auto-Interp
Negative Logits
spea
-0.47
])->
-0.45
väl
-0.45
wię
-0.45
英語版
-0.44
}=\{-0.43
Produzione
-0.42
SequentialGroup
-0.42
BeginInit
-0.42
δί
-0.42
POSITIVE LOGITS
bastard
1.04
idiot
0.94
bastards
0.94
scum
0.93
Bastard
0.92
idiots
0.90
morons
0.88
moron
0.87
asshole
0.86
umbag
0.86
Activations Density 0.225%