INDEX
    Explanations

    warnings or negative implications related to actions that can lead to significant negative consequences

    instances of the word "ruin" and its variations indicating negative consequences

    New Auto-Interp
    Negative Logits
    arij
    -0.81
    leground
    -0.80
    duino
    -0.80
    bors
    -0.76
     reluct
    -0.71
    appa
    -0.70
    soType
    -0.68
    rict
    -0.68
    rouch
    -0.66
    >>>>>>>>
    -0.65
    POSITIVE LOGITS
     havoc
    1.15
    ous
    0.90
    ously
    0.86
    OUS
    0.81
     spoil
    0.79
     spo
    0.78
     ruined
    0.76
    ifully
    0.76
     ruining
    0.75
     ruin
    0.74
    Act Density 0.039%

    No Known Activations