INDEX
Explanations
Instructions, opinions
The neuron flags instructional language that guides users to carry out unsafe, unethical, or otherwise disallowed actions.
any mention of dangerous or harmful activities and issues surrounding consent and ethics.
New Auto-Interp
Negative Logits
.URL
-0.07
top
-0.07
Frequency
-0.07
synergy
-0.06
Houses
-0.06
Sections
-0.06
benchmark
-0.06
Orders
-0.06
fish
-0.06
Velocity
-0.06
POSITIVE LOGITS
dern
0.08
amız
0.07
']?>"
0.06
اما
0.06
trough
0.06
극
0.06
siyas
0.06
aktif
0.06
imiz
0.06
ขนาด
0.06
Activations Density 0.041%