INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     characteristics
    -0.08
     invit
    -0.08
    farben
    -0.08
     Kry
    -0.07
     Erlebnis
    -0.07
    -0.07
    -indent
    -0.07
     provoking
    -0.07
     caos
    -0.07
     Characteristics
    -0.07
    POSITIVE LOGITS
    限制
    0.13
     ограничения
    0.12
     restricciones
    0.11
     beperk
    0.10
     restrictions
    0.10
     imposed
    0.10
     censorship
    0.10
    Restrictions
    0.10
    Restr
    0.10
     Restrictions
    0.10
    Act Density 0.008%

    No Known Activations