INDEX
    Explanations

    refusing harmful or unethical requests

    New Auto-Interp
    Negative Logits
     should
    0.46
    应该
    0.45
     hopefully
    0.45
     dovrebbe
    0.45
     conviene
    0.44
     باید
    0.43
    তবে
    0.42
    應該
    0.42
     può
    0.42
     manchmal
    0.42
    POSITIVE LOGITS
     request
    0.80
     언급
    0.75
    request
    0.72
     requesting
    0.72
     описание
    0.72
     descriptions
    0.70
     description
    0.69
     richiesta
    0.68
     запрос
    0.68
     requested
    0.67
    Act Density 0.063%

    No Known Activations