INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    6
    1.02
    8
    1.01
    1
    0.98
    4
    0.97
    7
    0.97
    </strong>
    0.96
    .
    0.94
    </h5>
    0.93
    </h3>
    0.93
    that
    0.84
    POSITIVE LOGITS
    ق
    1.23
    is
    1.15
    o
    1.09
    ির
    1.03
    د
    0.99
    h
    0.98
    in
    0.98
    q
    0.94
    ل
    0.91
    0.89
    Act Density 0.069%

    No Known Activations