INDEX
Negative Logits
Powered
-0.07
ubble
-0.07
béné
-0.07
pretrained
-0.07
ijan
-0.07
pag
-0.07
gt
-0.07
usap
-0.07
pare
-0.07
pprint
-0.07
POSITIVE LOGITS
disrespect
0.15
наруш
0.12
violates
0.10
jeopard
0.10
underm
0.10
disrupt
0.10
terhadap
0.10
disrupting
0.09
offend
0.09
нарушения
0.09
Activations Density 0.029%