INDEX
    Explanations
    No Explanations Found
    New Auto-Interp
    Negative Logits
     marav
    0.48
    强大
    0.43
     supremo
    0.43
     Faced
    0.43
     slaughtered
    0.42
    豪華
    0.42
     понадоби
    0.41
     spared
    0.40
     :)
    0.40
     ಚೆ
    0.40
    POSITIVE LOGITS
     harmful
    1.73
     unacceptable
    1.55
     problematic
    1.52
     distressing
    1.51
     disturbing
    1.50
     troubling
    1.50
     damaging
    1.44
     detrimental
    1.44
     unhealthy
    1.34
     unsettling
    1.33
    Act Density 0.965%

    No Known Activations