INDEX
Negative Logits
ULK
-0.06
_CATEGORY
-0.06
ับสน
-0.06
825
-0.06
(manager
-0.06
english
-0.06
assortment
-0.06
김
-0.06
English
-0.06
Gus
-0.06
POSITIVE LOGITS
violate
0.13
violated
0.12
violates
0.11
violations
0.11
violating
0.11
violation
0.11
违
0.08
_partial
0.08
υγ
0.08
viol
0.08
Activations Density 0.013%