INDEX
    Explanations

    words related to news events and possible controversial political statements

    New Auto-Interp
    Negative Logits
    </h2>
    -0.51
    </h3>
    -0.49
    </strong>
    -0.48
    OW
    -0.47
    ↵↵
    -0.47
    }{
    -0.47
     .
    -0.47
    -
    -0.46
    0
    -0.46
    -0.46
    POSITIVE LOGITS
     thut
    1.15
     Souha
    1.13
     fta
    1.12
     Juf
    1.11
     aen
    1.10
     Khart
    1.08
     fortn
    1.07
     dises
    1.07
     Adieu
    1.07
     squa
    1.05
    Act Density 0.115%

    No Known Activations