INDEX
    Explanations

    words related to something being untrustworthy or unreliable

    terms related to trust issues or untrustworthiness

    New Auto-Interp
    Negative Logits
     gorilla
    -0.74
    hyde
    -0.67
    anwhile
    -0.64
    å§«
    -0.64
    SHIP
    -0.62
    Reviewer
    -0.62
     Mercury
    -0.62
     vans
    -0.60
     coefficient
    -0.60
     stages
    -0.60
    POSITIVE LOGITS
    itled
    1.43
    rained
    1.42
    rans
    1.39
    apped
    1.31
    ested
    1.30
    rust
    1.26
    ruly
    1.25
    arget
    1.24
    race
    1.23
    oward
    1.23
    Act Density 0.018%

    No Known Activations