INDEX
    Explanations

    indications of concern or discussions about safety-related issues

    followed by personal pronouns

    expressing uncertainty or preference

    New Auto-Interp
    Negative Logits
     poichè
    -0.82
     ainfi
    -0.80
     !)
    -0.75
    -0.75
     feroit
    -0.73
     serupa
    -0.73
     آنان
    -0.72
     således
    -0.72
    已是
    -0.70
    几人
    -0.70
    POSITIVE LOGITS
     somebody
    1.20
     everybody
    1.12
     really
    1.11
     maybe
    1.05
    somebody
    1.01
     anybody
    1.00
     sort
    0.99
     [
    0.97
     basically
    0.97
     kind
    0.95
    Act Density 0.320%

    No Known Activations