INDEX
    Explanations

    pose a threat/risk/violation

    New Auto-Interp
    Negative Logits
    0.62
    0.58
    高峰
    0.57
     주의
    0.56
     cautious
    0.56
     ow
    0.56
    افظ
    0.56
    ah
    0.55
    ాలంటే
    0.55
     Junction
    0.54
    POSITIVE LOGITS
     teeth
    0.80
     existential
    0.69
     tooth
    0.69
     Trojan
    0.68
    ɬ
    0.67
     Transform
    0.67
     Teeth
    0.66
     attack
    0.66
     dientes
    0.66
     camada
    0.65
    Act Density 0.147%

    No Known Activations