INDEX
    Explanations

    Avoid certain words

    New Auto-Interp
    Negative Logits
     lokale
    -0.08
     שגם
    -0.08
     dama
    -0.08
     regionale
    -0.07
     sulla
    -0.07
     tout
    -0.07
     régional
    -0.07
     να
    -0.07
     dame
    -0.07
     soprattutto
    -0.07
    POSITIVE LOGITS
     blah
    0.10
    ??↵↵
    0.08
    uppercase
    0.08
     cleanliness
    0.08
     pleased
    0.08
    ???↵↵
    0.08
    Pizza
    0.08
     constructs
    0.07
     ??↵↵
    0.07
    0.07
    Act Density 0.015%

    No Known Activations