INDEX
Explanations
words associated with risks, consequences, and the importance of safety in various contexts
New Auto-Interp
Negative Logits
selves
-0.64
ovember
-0.60
enegger
-0.59
olulu
-0.57
+.
-0.56
poon
-0.55
iolet
-0.54
ornings
-0.54
Ago
-0.54
ECA
-0.54
POSITIVE LOGITS
iest
0.81
varies
0.80
becomes
0.75
consists
0.72
remains
0.71
is
0.71
goes
0.69
itself
0.69
reaches
0.67
disappears
0.67
Activations Density 0.276%