INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    s
    0.93
    I
    0.89
    1
    0.82
    asing
    0.80
     a
    0.79
    \
    0.76
    0.76
    houses
    0.76
    おい
    0.74
    alk
    0.74
    POSITIVE LOGITS
     be
    1.05
    ת
    1.04
    وية
    1.02
    ي
    1.02
    ul
    0.98
    ко
    0.94
    0.93
    ibilities
    0.92
    ות
    0.91
    ۔
    0.89
    Act Density 0.005%

    No Known Activations