INDEX
    Explanations

    making less desirable or weakening

    New Auto-Interp
    Negative Logits
     understandably
    0.54
    entions
    0.42
     aufgrund
    0.41
     Laurent
    0.40
     vanwege
    0.40
     quest
    0.39
    0.39
     explic
    0.38
     несмотря
    0.38
     explains
    0.38
    POSITIVE LOGITS
     destabil
    0.95
     disrupting
    0.87
     disrupt
    0.86
     demoral
    0.85
     discourage
    0.83
    故意
    0.81
     dissu
    0.79
     disrupts
    0.78
     disruption
    0.78
     discour
    0.77
    Act Density 0.048%

    No Known Activations