INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    H
    1.38
    I
    1.31
    O
    1.23
    A
    1.19
    1.18
    W
    1.17
    ل
    1.15
    i
    1.08
    T
    1.08
    F
    1.08
    POSITIVE LOGITS
     can
    1.05
     
    0.96
     are
    0.95
    .
    0.88
     château
    0.88
    yang
    0.86
     müssen
    0.85
    ];
    0.84
    𝑐
    0.84
     be
    0.84
    Act Density 0.000%

    No Known Activations