INDEX
    Explanations

    phrases or questions that express expectations or propose actions

    New Auto-Interp
    Head Attr Weights
    0:0.05
    1:0.03
    2:0.12
    3:0.26
    4:0.07
    5:0.06
    6:0.02
    7:0.13
    8:0.07
    9:0.02
    10:0.07
    11:0.04
    Negative Logits
    -2.68
    ée
    -2.51
    poral
    -2.50
    ocular
    -2.45
    ixt
    -2.40
     Illum
    -2.29
    スト
    -2.29
    ��
    -2.28
    ixture
    -2.27
    ゴン
    -2.26
    POSITIVE LOGITS
     idiots
    3.51
     devs
    3.42
     admins
    3.29
     doesnt
    3.22
     blackmail
    3.19
     downgrade
    3.07
     crap
    3.06
     incompetence
    3.03
     incentives
    2.92
     incentiv
    2.90
    Act Density 1.351%

    No Known Activations