INDEX
Explanations
instances of manipulation and deception in societal contexts
New Auto-Interp
Negative Logits
AssemblyProduct
-0.69
httphttps
-0.60
ніципалі
-0.60
addCriterion
-0.57
hyrchwyd
-0.57
lenker
-0.57
DotNetBar
-0.56
AssemblyTitle
-0.56
Personensuche
-0.55
RSpec
-0.55
POSITIVE LOGITS
fooled
1.34
gul
1.24
deceived
1.18
unsuspecting
1.11
naive
1.06
fool
1.05
fools
1.05
tricked
1.05
foolish
1.04
fall
1.02
Activations Density 0.242%