INDEX
Explanations
discussions about societal issues and moral dilemmas
harmful topics
the neuron detects racist or strongly derogatory language aimed at social groups (demeaning/offensive statements).
New Auto-Interp
Negative Logits
ကိုးကား
-0.44
bune
-0.39
+:+
-0.36
赦
-0.34
ภาค
-0.34
heureux
-0.34
excelencia
-0.34
onViewCreated
-0.33
orsche
-0.32
eterangan
-0.31
POSITIVE LOGITS
NSCoder
0.60
queryInterface
0.52
prostitution
0.50
ThroughAttribute
0.49
illegal
0.49
onPostExecute
0.49
DebuggerNonUser
0.49
harmful
0.48
behaviors
0.47
Bibliograf
0.47
Activations Density 0.096%