INDEX
    Explanations

    helpful AI assistant refusals

    New Auto-Interp
    Negative Logits
     callSettings
    0.42
     MaterialApp
    0.38
     freshest
    0.38
    乘以
    0.38
     말미암아
    0.38
    的我
    0.37
     morally
    0.36
     births
    0.36
     antérieur
    0.36
     antigenic
    0.36
    POSITIVE LOGITS
     chatbot
    0.63
    chatbot
    0.57
     assistant
    0.52
     AI
    0.49
     chatbots
    0.49
     guide
    0.48
     безопас
    0.48
    AI
    0.46
     helpful
    0.46
     innocuous
    0.46
    Act Density 0.021%

    No Known Activations