INDEX
Explanations
refusing to provide harmful instructions
New Auto-Interp
Negative Logits
堑
0.40
stated
0.38
megfe
0.38
identificado
0.37
햇
0.37
Ireland
0.36
Desired
0.36
plaie
0.36
habló
0.36
தாம்
0.36
POSITIVE LOGITS
Datas
0.41
Wikis
0.39
離
0.39
Metall
0.38
Civil
0.38
क़
0.38
logically
0.38
论文
0.37
离
0.37
λογ
0.37
Activations Density 0.003%