INDEX
Explanations
ethical and harmless principles
New Auto-Interp
Negative Logits
大胆
0.46
hydrophobic
0.46
runtime
0.43
ู่
0.42
nightlife
0.40
bold
0.40
Runtime
0.40
bold
0.39
夜
0.39
Bold
0.39
POSITIVE LOGITS
moral
0.99
altru
0.95
wholesome
0.93
Moral
0.90
善良
0.89
Moral
0.89
virtuous
0.89
moral
0.85
morals
0.83
ethical
0.82
Activations Density 0.392%