INDEX
    Explanations

    important considerations and disclaimers

    New Auto-Interp
    Negative Logits
     bragging
    0.64
     cunning
    0.64
    面白
    0.62
     hilarious
    0.61
     exaggerate
    0.61
    面白い
    0.60
     재미
    0.59
     quirks
    0.59
     glamorous
    0.59
     মজার
    0.57
    POSITIVE LOGITS
     respectful
    0.68
    Sensitivity
    0.64
     Feminist
    0.60
     Sensitivity
    0.59
     sensitively
    0.59
     educators
    0.59
     feminist
    0.58
     respectfully
    0.58
     LGBTQ
    0.58
     ধর্ষণ
    0.57
    Act Density 0.005%

    No Known Activations