INDEX
Explanations
The neuron fires on words that frame or qualify disallowed requests as “educational,” “scientific,” “research,” or “professional” purposes.
New Auto-Interp
Negative Logits
Indicates
-0.08
asto
-0.07
pys
-0.07
Www
-0.06
affles
-0.06
_office
-0.06
Soccer
-0.06
François
-0.06
Inches
-0.06
press
-0.06
POSITIVE LOGITS
online
0.07
'',
0.07
будь
0.06
Startup
0.06
regular
0.06
_strategy
0.06
) ↵ ↵ ↵
0.06
titulo
0.06
_div
0.06
adı
0.06
Activations Density 0.020%