INDEX
    Explanations

    end of parenthetical phrase

    question endings and code additions

    New Auto-Interp
    Negative Logits
    istical
    0.51
    理论
    0.50
    0.48
    0.48
    果た
    0.47
    0.47
    发挥
    0.46
     
    0.46
    0.46
    顺利
    0.46
    POSITIVE LOGITS
    ק
    0.64
     Фурга
    0.63
     сдела
    0.57
    די
    0.56
    люми
    0.55
    dL
    0.55
    сона
    0.55
    dS
    0.54
    ной
    0.53
     Мини
    0.53
    Act Density 0.002%

    No Known Activations