INDEX
    Explanations

    phrases indicating fairness or honesty

    phrases emphasizing fairness, honesty, and clarity in discussions

    New Auto-Interp
    Negative Logits
     surf
    -0.73
     med
    -0.64
    edu
    -0.62
    Build
    -0.61
    ãĥĺ
    -0.60
     seams
    -0.59
    ãĥIJ
    -0.58
     Written
    -0.58
     satur
    -0.57
    bern
    -0.57
    POSITIVE LOGITS
    ensional
    0.75
     Opinion
    0.75
    idge
    0.72
    ohn
    0.72
    oops
    0.70
     Ans
    0.69
    ayson
    0.69
    ESCO
    0.69
     Obj
    0.68
     Philippe
    0.67
    Act Density 0.080%

    No Known Activations