INDEX
Explanations
words related to unethical or illegal behavior, specifically misconduct
instances of the word "misconduct" and related phrases
New Auto-Interp
Negative Logits
LM
-0.76
hered
-0.73
Sabha
-0.69
Lear
-0.68
ebus
-0.68
zan
-0.66
izen
-0.64
Collider
-0.63
oos
-0.63
Juliet
-0.63
POSITIVE LOGITS
owship
0.99
discharge
0.76
utes
0.73
onduct
0.71
uracy
0.69
misconduct
0.69
aunders
0.68
orem
0.66
eatures
0.66
misc
0.66
Activations Density 0.037%