INDEX
    Explanations

    slaughterhouse and slight

    New Auto-Interp
    Negative Logits
    м
    1.24
    𝙙
    1.24
    م
    1.20
    𝙢
    1.20
    в
    1.19
    در
    1.18
    𝙜
    1.16
    𝙠
    1.15
    1.12
    к
    1.12
    POSITIVE LOGITS
    ad
    1.53
    5
    1.45
    ח
    1.43
    9
    1.41
    ط
    1.38
    ة
    1.37
    4
    1.35
    1
    1.33
    ap
    1.30
    יות
    1.29
    Act Density 0.005%

    No Known Activations