INDEX
    Explanations

    mentions of physical violence or aggressive actions

    New Auto-Interp
    Negative Logits
     coq
    -0.82
     stockholm
    -0.82
     purcha
    -0.80
     budapest
    -0.78
     increa
    -0.78
     wien
    -0.76
     lola
    -0.75
     sii
    -0.74
     fortn
    -0.74
     alre
    -0.73
    POSITIVE LOGITS
    <bos>
    0.72
     forehead
    0.52
     twice
    0.52
     face
    0.51
     directo
    0.47
     somewhere
    0.46
     ***!
    0.45
    ويد
    0.44
     head
    0.44
     shoulder
    0.43
    Act Density 0.159%

    No Known Activations