INDEX
    Explanations

    phrases that express negation or contradiction

    New Auto-Interp
    Head Attr Weights
    0:0.01
    1:0.01
    2:0.08
    3:0.34
    4:0.02
    5:0.02
    6:0.06
    7:0.11
    8:0.04
    9:0.06
    10:0.05
    11:0.15
    Negative Logits
     Pengu
    -1.30
    SY
    -1.27
     Boat
    -1.18
    EStream
    -1.16
    Gi
    -1.16
    �士
    -1.13
    ��
    -1.13
     Shades
    -1.12
     Ammo
    -1.12
     Lifetime
    -1.10
    POSITIVE LOGITS
    than
    1.64
     bends
    1.54
    iffe
    1.49
    forth
    1.29
    reaching
    1.28
    iating
    1.27
    mentioned
    1.26
    ences
    1.25
    aunts
    1.24
    iations
    1.23
    Act Density 0.017%

    No Known Activations