INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ar
    2.31
    ра
    2.19
    ق
    1.97
    ной
    1.91
    ab
    1.88
    est
    1.88
    ahah
    1.86
    ل
    1.85
    one
    1.84
    ır
    1.80
    POSITIVE LOGITS
    ک
    2.00
    𝘭
    1.96
    𝘣
    1.89
    𝘪
    1.85
    𝘳
    1.85
    ని
    1.80
    𝘥
    1.80
    𝘱
    1.77
    客様
    1.76
    ce
    1.71
    Act Density 0.018%

    No Known Activations