INDEX
    Explanations

    phrases related to advice or warnings

    phrases indicating potential hazards or dangers

    New Auto-Interp
    Negative Logits
     UNCLASSIFIED
    -0.81
     thereafter
    -0.76
     afterwards
    -0.75
     continuity
    -0.74
     stated
    -0.74
    emort
    -0.72
     substantive
    -0.72
     objectives
    -0.68
     secondly
    -0.67
     afterward
    -0.66
    POSITIVE LOGITS
    Researchers
    0.90
    Scientists
    0.83
    toggle
    0.82
    utterstock
    0.79
     Researchers
    0.78
     Guinness
    0.77
    hello
    0.76
     Nielsen
    0.76
     Shutterstock
    0.74
    Redditor
    0.74
    Act Density 0.855%

    No Known Activations