INDEX
    Explanations

    phrases related to explanation or reasoning

    phrases that indicate attribution or parts of a whole

    New Auto-Interp
    Negative Logits
    soever
    -0.85
    rams
    -0.79
    ancies
    -0.75
    ittal
    -0.75
     teasp
    -0.74
    ãģ®éŃĶ
    -0.72
    estyles
    -0.70
    oons
    -0.69
    ãĤ¼ãĤ¦ãĤ¹
    -0.68
    erion
    -0.66
    POSITIVE LOGITS
     why
    1.22
     what
    0.92
     reason
    0.89
     explaining
    0.83
    why
    0.79
     being
    0.79
     WHY
    0.78
     me
    0.77
     understanding
    0.76
     overcoming
    0.75
    Act Density 0.076%

    No Known Activations