INDEX
Explanations
instances of fraudulent and deceptive behavior
terms related to fraudulent and deceptive practices
New Auto-Interp
Negative Logits
arium
-0.77
nat
-0.75
hung
-0.74
area
-0.73
mun
-0.73
bur
-0.72
alist
-0.72
hed
-0.72
resent
-0.71
raq
-0.70
POSITIVE LOGITS
fraudulent
0.88
scam
0.85
unsuspecting
0.85
fraud
0.84
scams
0.82
dece
0.82
deceive
0.78
manipulative
0.77
deception
0.76
cheat
0.75
Activations Density 0.025%