INDEX
Explanations
This neuron detects age‐rating or maturity indicators (e.g. “18+,” “mature audiences”) in content warnings.
New Auto-Interp
Negative Logits
alumno
-0.07
meziná
-0.07
prostituerte
-0.06
LineNumber
-0.06
datingside
-0.06
tesis
-0.06
atual
-0.06
frase
-0.06
mice
-0.06
hiệu
-0.06
POSITIVE LOGITS
س
0.08
-п
0.06
Amendment
0.06
zman
0.06
_lin
0.06
.channel
0.06
ं
0.06
особ
0.06
關
0.06
715
0.06
Activations Density 0.001%