INDEX
    Explanations

    phrases related to justice, morality, and identity

    expressions of duality or conflicting identities

    New Auto-Interp
    Negative Logits
     theirs
    -0.48
    Availability
    -0.48
    ).[
    -0.47
    .).
    -0.47
     nevertheless
    -0.45
     nonetheless
    -0.44
    +.
    -0.44
     Ves
    -0.43
     eventual
    -0.42
    aults
    -0.42
    POSITIVE LOGITS
    ':
    0.63
    ?'
    0.62
    \":
    0.56
    \",
    0.50
    !'
    0.49
     ',
    0.47
    %"
    0.47
     Replay
    0.47
     ['
    0.46
    '?
    0.46
    Act Density 3.310%

    No Known Activations