INDEX
    Explanations

    negation, incorrect, strong sentiment

    New Auto-Interp
    Negative Logits
    ریت
    0.58
    标注
    0.48
    0.48
    ری
    0.47
    ρι
    0.46
    0.46
    0.46
    회를
    0.45
    キーワード
    0.45
    ată
    0.44
    POSITIVE LOGITS
     offent
    0.50
    lowski
    0.48
    fear
    0.46
     unn
    0.46
    faux
    0.45
    fuck
    0.45
    Fuck
    0.44
     hysteria
    0.44
     मनु
    0.43
     ఎలా
    0.43
    Act Density 0.003%

    No Known Activations