INDEX
Negative Logits
Fraud
-0.09
fak
-0.09
Bere
-0.08
_traits
-0.08
fraud
-0.08
Synd
-0.08
æ¬
-0.08
intr
-0.08
درجÙĩ
-0.08
èŃ
-0.08
POSITIVE LOGITS
original
0.17
hate
0.15
statement
0.13
speech
0.13
original
0.13
(original
0.12
message
0.12
Hate
0.12
initial
0.12
argument
0.11
Activations Density 0.058%