INDEX
Explanations
inappropriate misuse or illegality
New Auto-Interp
Negative Logits
+
0.75
+
0.65
uitgebre
0.63
spotless
0.62
quietly
0.59
或
0.59
own
0.57
willing
0.57
occasionally
0.57
leisurely
0.57
POSITIVE LOGITS
unlawful
1.28
inappropriate
1.26
unethical
1.25
inappropri
1.23
violates
1.21
harmful
1.17
illegal
1.16
inhum
1.13
misuse
1.11
illegal
1.11
Activations Density 2.445%