INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    -0.07
    _;↵↵
    -0.07
    PLIC
    -0.07
    (decoded
    -0.07
    imir
    -0.06
     cooked
    -0.06
     chased
    -0.06
    ]*
    -0.06
     мом
    -0.06
    !).↵↵
    -0.06
    POSITIVE LOGITS
    werp
    0.08
     lệ
    0.07
    怀疑
    0.07
    owing
    0.07
    onia
    0.07
     extremely
    0.07
    press
    0.06
    ק
    0.06
    0.06
    0.06
    Act Density 0.011%

    No Known Activations