INDEX
    Explanations

    open-weights model widely

    New Auto-Interp
    Negative Logits
    0.37
    0.36
    vera
    0.35
     hop
    0.34
    0.33
    0.33
    headings
    0.33
    bordered
    0.32
     سادہ
    0.32
    ula
    0.32
    POSITIVE LOGITS
     Marian
    0.40
    PBS
    0.39
    NAM
    0.36
    CNN
    0.36
     Embassy
    0.35
     PBS
    0.35
     Electrochem
    0.34
    AMP
    0.34
    celer
    0.34
    Dipl
    0.34
    Act Density 0.015%

    No Known Activations