INDEX
Explanations
strength, adaptability, preparedness
language associated with AI safety-policy refusals and content moderation, flagging explanations of why a request is harmful or disallowed and redirections to safer alternatives or support resources.
New Auto-Interp
Negative Logits
ᔨ
0.47
ゝ
0.43
棸
0.42
"../../
0.42
والإ
0.39
regain
0.39
symplect
0.39
交易所
0.39
rehabilitate
0.38
rehabilitation
0.38
POSITIVE LOGITS
INCRE
0.50
но
0.45
фект
0.42
orse
0.42
iny
0.41
jego
0.41
increased
0.41
новых
0.40
increases
0.39
космо
0.39
Activations Density 10.297%