INDEX
    Explanations

    phrases related to rationalization and logic

    New Auto-Interp
    Head Attr Weights
    0:0.02
    1:0.02
    2:0.04
    3:0.08
    4:0.16
    5:0.04
    6:0.04
    7:0.31
    8:0.03
    9:0.07
    10:0.08
    11:0.07
    Negative Logits
    ibaba
    -2.06
    auga
    -1.64
    psey
    -1.64
    phabet
    -1.64
    keyes
    -1.61
    title
    -1.57
    ailability
    -1.56
    ighth
    -1.56
    bley
    -1.55
    vals
    -1.55
    POSITIVE LOGITS
     guilt
    1.86
     inaction
    1.66
     selfish
    1.59
     greed
    1.54
     irrational
    1.54
     thinking
    1.52
     differently
    1.48
     fears
    1.47
     pleas
    1.47
     endlessly
    1.45
    Act Density 0.001%

    No Known Activations