INDEX
    Explanations

    attends to tokens that denote social media or platform links from tokens that are part of usernames or channel names

    New Auto-Interp
    Head Attr Weights
    0:0.07
    1:0.08
    2:0.15
    3:0.13
    4:0.06
    5:0.03
    6:0.22
    7:0.22
    Negative Logits
     autorytatywna
    -0.41
     kasarigan
    -0.34
     defaultstate
    -0.33
     ویکی‌پدی
    -0.33
     mergeFrom
    -0.31
    :✨
    -0.30
    }');
    -0.30
    SerializedSize
    -0.30
    jspb
    -0.29
    MigrationBuilder
    -0.29
    POSITIVE LOGITS
     Slee
    0.29
     ISTAT
    0.28
     catalyzed
    0.27
     Turch
    0.27
    bech
    0.26
     Luce
    0.26
     chao
    0.25
     emancipation
    0.25
     Eman
    0.25
    ilat
    0.25
    Act Density 0.039%

    No Known Activations