INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    was
    0.58
    :
    0.55
     designed
    0.53
     self
    0.51
     was
    0.51
     prohibitions
    0.51
    вання
    0.50
     sib
    0.50
    )."
    0.50
     desires
    0.50
    POSITIVE LOGITS
    0.53
    Пре
    0.52
    广播
    0.52
    Боль
    0.52
    0.51
    音乐
    0.50
    𝙵
    0.50
    也就是
    0.49
    َن
    0.49
    Ка
    0.49
    Act Density 0.002%

    No Known Activations