INDEX
    Explanations

    phrases indicating excessive behavior or overreach

    New Auto-Interp
    Negative Logits
    ussen
    -0.16
    iola
    -0.15
    stery
    -0.15
    rella
    -0.15
    dif
    -0.14
    yh
    -0.14
    .pref
    -0.14
    orget
    -0.13
    lectic
    -0.13
    ierge
    -0.13
    POSITIVE LOGITS
     extreme
    0.60
     extremes
    0.54
     Extreme
    0.53
    Extreme
    0.47
     extrem
    0.38
     excess
    0.35
     excessive
    0.34
     extremism
    0.34
    極
    0.29
     extremists
    0.28
    Act Density 0.231%

    No Known Activations