INDEX
Explanations
promoting
The neuron fires on key words and phrases that name or introduce unsafe‐content categories (e.g. “sexual…arouse,” “promotes,” “depicts,” “incites,” “self-harm,” “violence,” etc.), effectively marking tokens that specify policy violation types.
New Auto-Interp
Negative Logits
άνα
-0.07
ackages
-0.07
ITS
-0.06
items
-0.06
.buffer
-0.06
ERT
-0.06
Server
-0.06
dimensions
-0.06
monitor
-0.06
isateur
-0.06
POSITIVE LOGITS
,一
0.08
ebi
0.07
rtype
0.07
starttime
0.07
dığını
0.06
Tub
0.06
eBook
0.06
adolu
0.06
Hoover
0.06
quarterbacks
0.06
Activations Density 0.016%