INDEX
    Explanations

    sentences that provide explanations or reasons for a given situation or action

    New Auto-Interp
    Negative Logits
    nces
    -0.70
     torch
    -0.70
    readable
    -0.69
    kun
    -0.68
    imet
    -0.65
    uania
    -0.65
    borg
    -0.64
    ona
    -0.63
    adiq
    -0.63
    istered
    -0.62
    POSITIVE LOGITS
    Because
    1.22
    Reason
    1.20
     Because
    1.19
     reasons
    1.13
    Cause
    1.13
    cause
    1.11
     Reasons
    1.11
    ecause
    1.07
     WHY
    1.00
     because
    0.97
    Act Density 0.169%

    No Known Activations