INDEX
Negative Logits
uncover
-0.07
اختیار
-0.07
soft
-0.07
gorithms
-0.07
eligibility
-0.07
_pieces
-0.06
reservations
-0.06
.,
-0.06
availability
-0.06
.family
-0.06
POSITIVE LOGITS
insulting
0.12
insults
0.12
insult
0.11
0.07
.Disclaimer
0.06
abusive
0.06
.Interfaces
0.06
humiliation
0.06
thanking
0.06
rebut
0.06
Activations Density 0.006%