INDEX
    Explanations

    github/huggingface links

    New Auto-Interp
    Negative Logits
     née
    1.10
    0.98
     veneer
    0.98
     trabajó
    0.97
    iftoire
    0.97
     Corollary
    0.97
     dilakukan
    0.97
     técnicos
    0.97
     manuals
    0.94
     asymmetry
    0.94
    POSITIVE LOGITS
    D
    0.88
    K
    0.87
    USERNAME
    0.83
    notin
    0.83
    username
    0.82
    g
    0.80
    Lazy
    0.77
    sympy
    0.77
    T
    0.76
    Microsoft
    0.74
    Act Density 0.069%

    No Known Activations