INDEX
Explanations
instances of dishonesty or unethical behavior in various contexts
New Auto-Interp
Negative Logits
rome
-0.16
502
-0.15
alon
-0.15
odef
-0.15
виÑĤ
-0.14
ayan
-0.14
linger
-0.14
rottle
-0.14
Ñģл
-0.14
ī´
-0.14
POSITIVE LOGITS
rig
0.38
Rig
0.36
rigged
0.35
manipulation
0.34
tam
0.32
rig
0.31
manipulated
0.31
rigs
0.30
manipulating
0.29
manip
0.29
Activations Density 0.065%