INDEX
    Explanations

    assistant/model-turn content, especially structured, advisory responses with headings, bullet points, and safety/disclaimer framing.

    New Auto-Interp
    Negative Logits
     Vorstand
    0.44
    もあります
    0.42
    0.42
    ißler
    0.41
    0.41
    Unifier
    0.40
     شاء
    0.40
    िन
    0.40
    শ্ব
    0.40
    0.40
    POSITIVE LOGITS
     silenced
    0.43
     faz
    0.42
     fazia
    0.41
     simplement
    0.41
     momentos
    0.39
     amea
    0.39
     recuerdos
    0.39
     polarity
    0.38
     motionless
    0.38
     deje
    0.38
    Act Density 10.779%

    No Known Activations