INDEX
Explanations
phrases related to advice or warnings
phrases indicating potential hazards or dangers
New Auto-Interp
Negative Logits
UNCLASSIFIED
-0.81
thereafter
-0.76
afterwards
-0.75
continuity
-0.74
stated
-0.74
emort
-0.72
substantive
-0.72
objectives
-0.68
secondly
-0.67
afterward
-0.66
POSITIVE LOGITS
Researchers
0.90
Scientists
0.83
toggle
0.82
utterstock
0.79
Researchers
0.78
Guinness
0.77
hello
0.76
Nielsen
0.76
Shutterstock
0.74
Redditor
0.74
Activations Density 0.855%