INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     dishonest
    -0.06
    رش
    -0.06
     bestimm
    -0.06
    ibility
    -0.06
     betrayed
    -0.06
     skilled
    -0.06
    450
    -0.06
     rpt
    -0.06
     defense
    -0.06
     temperature
    -0.06
    POSITIVE LOGITS
    .main
    0.07
    )}</
    0.07
    /Sub
    0.07
     ním
    0.07
     Bloody
    0.07
    <I
    0.06
     мик
    0.06
     mpi
    0.06
     Πολι
    0.06
    ]];↵↵
    0.06
    Act Density 0.081%

    No Known Activations