INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ד
    0.81
    бу
    0.78
    一层
    0.74
     aeroplane
    0.72
    ва
    0.69
    ות
    0.68
     asos
    0.68
    ދ
    0.68
    بی
    0.66
    ж
    0.66
    POSITIVE LOGITS
    ren
    0.78
    er
    0.71
    ory
    0.66
    heart
    0.66
    ox
    0.65
    King
    0.65
    her
    0.64
    ong
    0.64
    onder
    0.63
    oms
    0.62
    Act Density 0.012%

    No Known Activations