INDEX
Explanations
references to pornographic material and related content
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1174
+0.13
0.4%
1961
+0.10
0.4%
1363
+0.10
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1174
+0.13
0.02
944
+0.10
0.02
625
+0.10
0.02
Negative Logits
Sklici
-0.61
醐
-0.57
solidar
-0.55
Reparto
-0.54
Referencie
-0.52
Kör
-0.52
Kö
-0.49
Zunanje
-0.49
Merk
-0.48
kapital
-0.48
POSITIVE LOGITS
porn
1.23
pornography
1.11
Porn
1.10
Porn
0.95
porn
0.94
pamph
0.71
unspeak
0.65
surpl
0.63
subgoals
0.62
bourg
0.62
Activations Density 0.071%