INDEX
    Explanations

    violence and hate speech

    New Auto-Interp
    Negative Logits
    =|
    0.41
     recordings
    0.39
    聞いた
    0.38
     committee
    0.38
    attie
    0.37
     allocate
    0.37
     Correspondence
    0.37
    aha
    0.36
     electrical
    0.36
     meals
    0.36
    POSITIVE LOGITS
     demost
    0.47
    Nuestro
    0.44
    インド
    0.43
     dimost
    0.43
    Чи
    0.43
     demostrar
    0.42
     interrom
    0.42
    イチ
    0.42
    0.42
    rendre
    0.42
    Act Density 13.922%

    No Known Activations