INDEX
Explanations
censored profanity
this neuron detects profanity or expletive fragments (e.g. censored swear‐word symbols).
New Auto-Interp
Negative Logits
abant
-0.06
("-0.06
783
-0.06
Noise
-0.06
baths
-0.06
.untracked
-0.06
ces
-0.06
JK
-0.06
Lincoln
-0.06
ÜNİVERS
-0.06
POSITIVE LOGITS
picked
0.07
$time
0.06
tháng
0.06
Changed
0.06
.ViewModels
0.06
ومات
0.06
Wax
0.06
прав
0.06
_sd
0.06
-degree
0.06
Activations Density 0.005%