INDEX
    Explanations

    references to harmful or dangerous entities, such as killer whales or people

    references to "killer whales."

    New Auto-Interp
    Negative Logits
    ional
    -0.91
    ational
    -0.83
    ourced
    -0.83
    bles
    -0.82
    é¾
    -0.80
    yrinth
    -0.80
    ourcing
    -0.78
    edu
    -0.78
    OVER
    -0.77
    atile
    -0.76
    POSITIVE LOGITS
     killer
    0.93
    killer
    0.92
     killers
    0.84
     whales
    0.82
     Killer
    0.79
     spree
    0.77
     instinct
    0.75
     whale
    0.73
    knife
    0.71
     Orange
    0.71
    Act Density 0.028%

    No Known Activations