INDEX
    Explanations

    mentions of negative outcomes, specifically losses

    occurrences of the word "loss" in various contexts

    New Auto-Interp
    Negative Logits
     Instruct
    -0.67
     indo
    -0.66
     Surve
    -0.66
    omet
    -0.65
    imaru
    -0.65
    commun
    -0.64
    Instruct
    -0.63
    JB
    -0.63
    ç«
    -0.62
    ulhu
    -0.62
    POSITIVE LOGITS
     loss
    3.76
     Loss
    2.94
    loss
    2.91
     losses
    2.62
     losing
    1.74
     lose
    1.60
     lost
    1.55
     defeat
    1.46
     loses
    1.42
     setback
    1.40
    Act Density 0.019%

    No Known Activations