INDEX
    Explanations

    phrases related to explanations or clarity of concepts and actions

    New Auto-Interp
    Head Attr Weights
    0:0.02
    1:0.02
    2:0.05
    3:0.08
    4:0.10
    5:0.02
    6:0.04
    7:0.45
    8:0.02
    9:0.02
    10:0.07
    11:0.06
    Negative Logits
    elight
    -2.08
    -1.73
    emouth
    -1.62
    erity
    -1.62
    mouth
    -1.54
    cedented
    -1.50
    ibaba
    -1.50
     pilgr
    -1.44
    ngth
    -1.43
    rive
    -1.41
    POSITIVE LOGITS
    why
    2.09
     WHY
    2.08
     why
    2.07
     aloud
    1.93
     gist
    1.90
    actionDate
    1.84
     intric
    1.81
     convoluted
    1.81
     complicated
    1.79
     misunderstand
    1.79
    Act Density 0.054%

    No Known Activations