INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Sensors
    -0.09
     sensors
    -0.09
     Sensor
    -0.08
    Sensors
    -0.08
     sensor
    -0.08
    sensor
    -0.08
     Io
    -0.07
     वर्ष
    -0.07
    _sensor
    -0.07
    Sensor
    -0.07
    POSITIVE LOGITS
     derog
    0.14
     disrespect
    0.11
     racist
    0.11
     respectful
    0.10
     hateful
    0.10
     sexist
    0.10
     العن
    0.10
     profanity
    0.10
     insulting
    0.09
     lädt
    0.09
    Act Density 0.042%

    No Known Activations