INDEX
Explanations
terms related to domains in web or internet contexts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
376
+0.17
1.0%
117
+0.11
0.7%
91
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
117
+0.17
0.01
91
+0.11
0.01
126
+0.11
0.01
Negative Logits
queer
-1.71
tear
-1.69
transgender
-1.66
maternal
-1.65
promise
-1.59
birth
-1.56
miscar
-1.55
ights
-1.54
↵
-1.54
inher
-1.53
POSITIVE LOGITS
£
2.73
ļ
2.60
ĸ´
2.60
ĸ
2.58
↵
2.58
↵↵
2.58
↵
2.58
2.58
↵
2.58
2.58
Activations Density 0.141%