INDEX
    Explanations

    phrases indicating justification or lack of justification

    instances of the word "reason" and its variations, indicating justifications or rationales

    New Auto-Interp
    Negative Logits
    chin
    -0.68
    chron
    -0.63
     Carbuncle
    -0.61
    xon
    -0.60
    tein
    -0.59
    ModLoader
    -0.57
    ilation
    -0.57
    ophon
    -0.56
     Warcraft
    -0.55
    rodu
    -0.54
    POSITIVE LOGITS
     why
    1.51
    why
    1.32
     WHY
    1.23
    abl
    1.06
     Why
    0.99
    Why
    0.96
     justifying
    0.81
     rationale
    0.80
     justification
    0.78
    Reviewer
    0.74
    Act Density 0.044%

    No Known Activations