INDEX
    Explanations

    words associated with correctness or justification

    terms related to fairness and appropriateness in various contexts

    New Auto-Interp
    Negative Logits
     clos
    -0.66
    iments
    -0.66
     Alam
    -0.65
     dolls
    -0.65
    ession
    -0.63
     Volunteers
    -0.63
     shirts
    -0.63
    ema
    -0.62
     fertility
    -0.60
     Football
    -0.60
    POSITIVE LOGITS
    ãĤ©
    0.96
     rightfully
    0.95
     deserved
    0.92
     rightly
    0.90
    é¾į
    0.85
    è¯
    0.78
     outweigh
    0.77
    ãĥ£
    0.75
    æĺ¯
    0.74
    eous
    0.73
    Act Density 0.013%

    No Known Activations