INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    arness
    -0.07
    ceiver
    -0.06
    ��
    -0.06
     sideways
    -0.05
    -0.05
     nominate
    -0.05
     empath
    -0.05
    -contrib
    -0.05
    ukarı
    -0.05
    -0.05
    POSITIVE LOGITS
    DI
    0.19
    di
    0.16
    edi
    0.15
    EDI
    0.14
    aldi
    0.13
    andi
    0.13
    udi
    0.12
     edi
    0.11
    ardi
    0.10
    endi
    0.10
    Act Density 0.011%

    No Known Activations