INDEX
    Explanations

    statements related to disclosure or justification

    discussions centered around honesty and revelation

    New Auto-Interp
    Negative Logits
    phalt
    -0.76
    rians
    -0.70
    atl
    -0.70
    croft
    -0.68
    agues
    -0.67
    rian
    -0.66
    erva
    -0.66
    iatrics
    -0.65
    enf
    -0.65
    onder
    -0.64
    POSITIVE LOGITS
     his
    1.07
     their
    1.03
     herself
    0.93
     her
    0.89
     owning
    0.87
     what
    0.86
     how
    0.86
    their
    0.86
     himself
    0.85
     quitting
    0.85
    Act Density 0.401%

    No Known Activations