INDEX
    Explanations

    medical or other professional advice

    Sentences or phrases where the model refuses harmful requests and provides safety guidance, resource links, and crisis/support information.

    New Auto-Interp
    Negative Logits
     இருந்தாலும்
    0.45
     Trends
    0.42
     অ্যান্ড্র
    0.41
     Experience
    0.41
     Library
    0.40
     अक्सर
    0.40
     varietà
    0.40
     போலவே
    0.39
    便利
    0.39
     Crypt
    0.38
    POSITIVE LOGITS
     urgently
    0.53
     murderous
    0.53
     manifestly
    0.50
    Neces
    0.50
     IMMEDI
    0.49
     endangering
    0.48
     perpetrators
    0.48
     absolutamente
    0.48
     dringend
    0.48
     urgente
    0.47
    Act Density 0.388%

    No Known Activations