INDEX
    Explanations

    structured, instructional explanations and advice (guide-like, step-by-step or “breakdown” style content typical of assistant responses).

    New Auto-Interp
    Negative Logits
     Alloys
    0.40
     slump
    0.39
     სას
    0.38
    пня
    0.38
    0.37
     condemn
    0.37
     swelling
    0.37
     tee
    0.37
     cargas
    0.36
     unatt
    0.36
    POSITIVE LOGITS
     خدمات
    0.45
     российских
    0.43
     ಕಾಣ
    0.42
     تتم
    0.42
    נית
    0.42
     아마
    0.42
    EDY
    0.42
    dishwasher
    0.42
    RUB
    0.41
    AMENTO
    0.41
    Act Density 14.575%

    No Known Activations