INDEX
    Explanations

    refusing harmful content

    New Auto-Interp
    Negative Logits
     Paf
    0.74
    かね
    0.72
     нас
    0.69
     ktor
    0.69
    视角
    0.69
     başına
    0.68
     מאוד
    0.66
     	i
    0.64
    0.64
     гораздо
    0.63
    POSITIVE LOGITS
     my
    1.19
    my
    1.09
     My
    1.06
    My
    1.03
     attempt
    0.98
    MY
    0.91
     MY
    0.91
     such
    0.90
     мои
    0.87
     Such
    0.87
    Act Density 0.189%

    No Known Activations