INDEX
    Explanations

    AI assistant refusing harmful requests

    New Auto-Interp
    Negative Logits
     flax
    0.46
     cacao
    0.44
    合作
    0.39
    আনু
    0.38
    elv
    0.38
    0.38
     sacar
    0.38
    elina
    0.37
     dast
    0.37
     levi
    0.37
    POSITIVE LOGITS
    0.41
    getReference
    0.36
     అంత
    0.35
    Insert
    0.34
     விளங்க
    0.34
    गौर
    0.34
    0.33
    のと
    0.33
     വഹ
    0.33
    0.33
    Act Density 0.010%

    No Known Activations