INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     TestBed
    -0.51
    ContentAsync
    -0.48
     kautta
    -0.46
    ANDUM
    -0.45
     volna
    -0.45
     mío
    -0.44
     szól
    -0.41
    imeni
    -0.40
    ckså
    -0.40
     gärna
    -0.40
    POSITIVE LOGITS
    Safety
    0.77
     Safety
    0.73
     safety
    0.73
     SAFETY
    0.70
    SAFETY
    0.70
    safety
    0.68
    SAFE
    0.65
     安全
    0.60
    Safe
    0.59
     safe
    0.59
    Act Density 0.018%

    No Known Activations