INDEX
    Explanations

    words related to restriction, censorship, or prohibition

    terms related to blocking or censorship

    New Auto-Interp
    Negative Logits
    ller
    -0.73
    lli
    -0.73
    brates
    -0.72
    brate
    -0.72
    llers
    -0.70
    gow
    -0.70
    rious
    -0.69
    ivil
    -0.66
    ria
    -0.66
    EMBER
    -0.66
    POSITIVE LOGITS
     blocking
    0.95
    buster
    0.90
    busters
    0.88
    listed
    0.80
     blockers
    0.80
    chains
    0.78
    quote
    0.78
    aded
    0.78
    lights
    0.78
    ades
    0.77
    Act Density 0.019%

    No Known Activations