INDEX
    Explanations

    mentions of people or organizations, potentially in a negative context

    the symbol or character representation of certain expressions or emphasis

    New Auto-Interp
    Negative Logits
     imitation
    -0.89
     Seym
    -0.72
     indo
    -0.69
    anium
    -0.69
     mathemat
    -0.68
     accompan
    -0.66
     constitu
    -0.66
     fortun
    -0.66
    arios
    -0.65
     disadvant
    -0.64
    POSITIVE LOGITS
    ï¸ı
    1.23
    ï¸
    0.94
    VER
    0.85
    女
    0.85
    Balt
    0.82
    STEM
    0.80
    sure
    0.76
    legal
    0.75
    own
    0.74
    £
    0.73
    Act Density 0.509%

    No Known Activations