INDEX
    Explanations

    arguments related to morality and hypocrisy

    New Auto-Interp
    Negative Logits
    olio
    -0.17
    oho
    -0.17
    als
    -0.16
    icz
    -0.15
    dan
    -0.15
    inte
    -0.15
    essen
    -0.15
    inho
    -0.15
    ounty
    -0.15
    unch
    -0.14
    POSITIVE LOGITS
     rather
    0.34
     nor
    0.33
     instead
    0.32
     merely
    0.32
    nor
    0.31
     Rather
    0.30
    rather
    0.30
    Nor
    0.29
    Rather
    0.29
    Instead
    0.28
    Act Density 0.252%

    No Known Activations