INDEX
    Explanations

    Extract workshops Q 1 post

    New Auto-Interp
    Negative Logits
     juſt
    -1.07
     كلام
    -0.94
     nk
    -0.92
     durant
    -0.91
    aporation
    -0.91
     joyful
    -0.90
     grumpy
    -0.89
    tember
    -0.88
     לאחר
    -0.85
     jakość
    -0.85
    POSITIVE LOGITS
     before
    0.94
     both
    0.86
     invid
    0.84
    hinh
    0.83
    bati
    0.82
     well
    0.79
    dirond
    0.78
     Before
    0.78
    }^{*}\
    0.77
     at
    0.76
    Act Density 0.005%

    No Known Activations