INDEX
Explanations
refusing harmful user requests
New Auto-Interp
Negative Logits
souci
0.79
咱
0.79
Approach
0.78
んだけど
0.75
rethinking
0.74
leadsto
0.74
ਖ
0.74
Approach
0.72
機會
0.71
dyspe
0.70
POSITIVE LOGITS
functionalities
0.99
ideologies
0.90
ultimatum
0.89
morals
0.88
advancements
0.83
parameters
0.83
persona
0.82
ideology
0.82
premise
0.81
skillset
0.81
Activations Density 0.179%