INDEX
    Explanations

    strength, adaptability, preparedness

    language associated with AI safety-policy refusals and content moderation, flagging explanations of why a request is harmful or disallowed and redirections to safer alternatives or support resources.

    New Auto-Interp
    Negative Logits
    0.47
    0.43
    0.42
     "../../
    0.42
     والإ
    0.39
     regain
    0.39
     symplect
    0.39
    交易所
    0.39
     rehabilitate
    0.38
     rehabilitation
    0.38
    POSITIVE LOGITS
    INCRE
    0.50
    но
    0.45
    фект
    0.42
    orse
    0.42
    iny
    0.41
    jego
    0.41
    increased
    0.41
    новых
    0.40
    increases
    0.39
     космо
    0.39
    Act Density 10.297%

    No Known Activations