INDEX
    Explanations

    concepts related to responsibility and accountability

    New Auto-Interp
    Negative Logits
    lingen
    -0.15
    lass
    -0.14
    Ãło
    -0.14
     Ãľst
    -0.14
    okoj
    -0.13
    меÑĩ
    -0.13
    imers
    -0.13
    lek
    -0.13
    LC
    -0.13
    -Origin
    -0.13
    POSITIVE LOGITS
     lies
    1.02
     lie
    0.99
     lying
    0.79
     Lies
    0.76
     Lie
    0.74
     lay
    0.69
    lie
    0.68
    lies
    0.68
     lied
    0.66
    Lie
    0.66
    Act Density 0.366%

    No Known Activations