INDEX
    Explanations

    phrases related to providing explanations or reasons

    phrases that indicate explanations or justifications

    New Auto-Interp
    Negative Logits
    ography
    -0.77
    emies
    -0.76
    ctors
    -0.74
    nets
    -0.74
    heit
    -0.71
    Dialogue
    -0.71
    jab
    -0.69
    dayName
    -0.69
    nown
    -0.68
    ograp
    -0.67
    POSITIVE LOGITS
     why
    1.53
    why
    1.13
     WHY
    0.99
     discrepancies
    0.97
     variance
    0.91
     reluctance
    0.89
     inconsistencies
    0.84
     discrep
    0.82
     how
    0.80
     disparities
    0.79
    Act Density 0.126%

    No Known Activations