INDEX
    Explanations

    verbs related to causing harm

    references to the concept of "ruin" and its derivatives

    New Auto-Interp
    Negative Logits
    appa
    -0.74
    duino
    -0.68
    heter
    -0.67
    enne
    -0.67
    bors
    -0.67
    leground
    -0.67
    arij
    -0.66
    gencies
    -0.66
    taboola
    -0.63
    WER
    -0.63
    POSITIVE LOGITS
     havoc
    1.07
    ous
    0.96
    ously
    0.94
    OUS
    0.81
    fully
    0.81
    stal
    0.81
     spoil
    0.79
     spo
    0.77
    strument
    0.76
    ifully
    0.76
    Act Density 0.018%

    No Known Activations