INDEX
    Explanations

    words or phrases that indicate instances of deception or betrayal

    New Auto-Interp
    Negative Logits
    362
    -0.15
     Strauss
    -0.14
    274
    -0.14
     Vend
    -0.14
    ë¡ľëĵľ
    -0.14
    enth
    -0.14
    eba
    -0.14
     minority
    -0.13
    illis
    -0.13
    174
    -0.13
    POSITIVE LOGITS
     sh
    0.25
    enan
    0.17
     INCIDENT
    0.16
    alink
    0.16
    ETHER
    0.16
    .sh
    0.16
    vier
    0.15
    abby
    0.15
    sh
    0.15
     ort
    0.15
    Act Density 0.024%

    No Known Activations