INDEX
    Explanations

    references to fake or fraudulent concepts and entities

    New Auto-Interp
    Negative Logits
    ils
    -0.16
    shire
    -0.16
    ally
    -0.16
    à¸ĵ
    -0.15
    erable
    -0.15
    yle
    -0.14
    оÑĢг
    -0.14
     Naz
    -0.14
    iw
    -0.14
     Nz
    -0.14
    POSITIVE LOGITS
    /false
    0.24
    .fake
    0.19
    stin
    0.18
    fak
    0.18
    (fake
    0.17
    busters
    0.17
    pret
    0.16
    /mock
    0.16
    ulence
    0.15
    eries
    0.15
    Act Density 0.027%

    No Known Activations