INDEX
    Explanations

    refusal of harmful requests

    New Auto-Interp
    Negative Logits
     harbor
    0.47
     harbors
    0.44
     flavor
    0.44
     flavors
    0.41
     multicolored
    0.40
     Gravel
    0.40
     favorably
    0.39
     Sliver
    0.39
     Harbor
    0.39
    nessy
    0.38
    POSITIVE LOGITS
     nggak
    0.45
     TikTok
    0.45
     चीज़
    0.45
    TikTok
    0.44
     netizens
    0.44
    0.43
     एक्सरसा
    0.43
     Pics
    0.43
     personalisation
    0.43
     آ
    0.42
    Act Density 0.009%

    No Known Activations