INDEX
Explanations
instances of dishonest behavior, specifically cheating
terms related to dishonest behavior and cheating
New Auto-Interp
Negative Logits
ŃĶ
-0.76
area
-0.71
escal
-0.70
areth
-0.69
oran
-0.67
Vert
-0.67
entin
-0.63
rez
-0.62
vig
-0.62
eric
-0.61
POSITIVE LOGITS
cheating
0.88
cheat
0.81
cheated
0.80
ulence
0.78
loophole
0.72
raud
0.71
ulent
0.70
exploited
0.68
loopholes
0.67
wrongdoing
0.65
Activations Density 0.111%