INDEX
Explanations
phrases related to behavior and misconduct
New Auto-Interp
Negative Logits
Solution
-0.76
Puzzles
-0.74
Arri
-0.73
cells
-0.73
Cells
-0.71
reader
-0.69
Qiao
-0.68
houses
-0.66
vae
-0.66
ropolis
-0.66
POSITIVE LOGITS
unlawful
1.07
contrary
1.03
morally
1.01
unethical
1.01
lawful
1.00
inappropriate
0.99
violate
0.97
justified
0.95
eth
0.93
repre
0.92
Activations Density 0.373%