INDEX
    Explanations

    hateful extremist ideologies

    New Auto-Interp
    Negative Logits
     Physical
    0.51
    Physical
    0.45
     physical
    0.45
     เหมาะ
    0.44
     fizik
    0.43
     Professional
    0.43
     Phys
    0.41
     trusty
    0.41
     aventure
    0.41
    商业
    0.40
    POSITIVE LOGITS
     extremist
    1.15
     hateful
    1.08
     propaganda
    1.06
     fascist
    1.06
     misog
    1.05
     propagand
    1.05
     extremism
    1.03
     ideology
    1.02
     sadistic
    0.98
     racist
    0.97
    Act Density 0.048%

    No Known Activations