INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     collusion
    -0.07
     kindness
    -0.07
     meta
    -0.07
     ctxt
    -0.06
     числі
    -0.06
    imensional
    -0.06
    El
    -0.06
     üzerine
    -0.06
     über
    -0.06
    eted
    -0.06
    POSITIVE LOGITS
     Justice
    0.19
    Justice
    0.16
     justice
    0.13
    justice
    0.08
    darwin
    0.07
    0.07
    distinct
    0.07
     healthcare
    0.06
     профес
    0.06
    })↵↵
    0.06
    Act Density 0.005%

    No Known Activations