INDEX
    Explanations

    words related to deception or misrepresentation

    terms related to misrepresentation and distortion of information

    New Auto-Interp
    Negative Logits
    rises
    -0.77
    force
    -0.72
    achine
    -0.70
    ¯¯¯¯
    -0.68
    ça
    -0.66
    spot
    -0.65
     hungry
    -0.65
    fi
    -0.64
    irez
    -0.63
    ====
    -0.63
    POSITIVE LOGITS
     misrepresent
    0.86
     inaccur
    0.84
     distortions
    0.82
     distort
    0.80
     distortion
    0.80
     inaccurate
    0.78
     falsely
    0.76
     distorted
    0.74
     omission
    0.73
     perceptions
    0.71
    Act Density 0.062%

    No Known Activations