INDEX
    Explanations

    instances of dishonesty or unethical behavior in various contexts

    New Auto-Interp
    Negative Logits
    rome
    -0.16
    502
    -0.15
    alon
    -0.15
    odef
    -0.15
    виÑĤ
    -0.14
    ayan
    -0.14
    linger
    -0.14
    rottle
    -0.14
    Ñģл
    -0.14
    ī´
    -0.14
    POSITIVE LOGITS
     rig
    0.38
     Rig
    0.36
     rigged
    0.35
     manipulation
    0.34
     tam
    0.32
    rig
    0.31
     manipulated
    0.31
     rigs
    0.30
     manipulating
    0.29
     manip
    0.29
    Act Density 0.065%

    No Known Activations