INDEX
    Explanations

    statements or actions related to decisions made in various situations

    New Auto-Interp
    Negative Logits
    vae
    -0.74
    english
    -0.69
     havoc
    -0.68
    amen
    -0.67
    icas
    -0.67
    outh
    -0.67
    ighth
    -0.65
    uana
    -0.64
    anti
    -0.64
    eco
    -0.64
    POSITIVE LOGITS
     makers
    1.04
    jar
    0.92
     maker
    0.89
    making
    0.83
     decision
    0.83
     decisions
    0.81
    maker
    0.79
     ACTIONS
    0.76
    makers
    0.75
    lessness
    0.69
    Act Density 0.037%

    No Known Activations