INDEX
Explanations
Controversial social/political discussions
The neuron activates on Spanish text segments (Spanish words and punctuation).
finds tokens that indicate abusive or harmful content, especially hate speech, racist/discriminatory claims, or calls for violence/self-harm.
content involving hate speech or slurs and sensitive discussions about race and discrimination, often alongside other safety-flag topics like violence or self-harm.
New Auto-Interp
Negative Logits
rnd
-0.07
ивать
-0.07
raud
-0.07
factor
-0.07
separ
-0.06
technically
-0.06
λογή
-0.06
ifica
-0.06
-0.06
impeachment
-0.06
POSITIVE LOGITS
-submit
0.07
REDIT
0.06
ař
0.06
朋
0.06
descri
0.06
Country
0.06
Az
0.06
Slot
0.06
slots
0.06
commem
0.06
Activations Density 0.103%