INDEX
    Explanations

    references to sexual exploitation and trafficking

    New Auto-Interp
    Negative Logits
    bjerg
    -0.17
    killer
    -0.16
    ettle
    -0.16
    noop
    -0.14
    nerg
    -0.14
    ULSE
    -0.14
     assassin
    -0.14
     setattr
    -0.14
    diet
    -0.13
     Leakage
    -0.13
    POSITIVE LOGITS
     trafficking
    0.41
     Traff
    0.39
     traff
    0.39
     sex
    0.35
    Tra
    0.31
     human
    0.31
    -tra
    0.29
     traf
    0.28
     exploitation
    0.28
     forced
    0.28
    Act Density 0.025%

    No Known Activations