INDEX
    Explanations

    phrases that involve justifying actions or making excuses

    New Auto-Interp
    Negative Logits
    eldon
    -0.14
    VO
    -0.14
    AFX
    -0.14
     rodin
    -0.14
    KeyValue
    -0.14
    ê³ł
    -0.14
    먹
    -0.13
    ạt
    -0.13
    andest
    -0.13
    itivity
    -0.13
    POSITIVE LOGITS
     why
    0.29
     justify
    0.23
    why
    0.23
    为ä»Ģä¹Ī
    0.21
     Why
    0.20
     justification
    0.20
    justify
    0.19
    Why
    0.18
     reasons
    0.18
     Reasons
    0.17
    Act Density 0.157%

    No Known Activations