INDEX
Explanations
phrases related to social media hashtags
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
298
+0.13
0.5%
1137
+0.12
0.4%
605
+0.12
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
298
+0.13
0.04
1137
+0.12
0.03
755
+0.12
0.03
Negative Logits
olivia
-0.57
batman
-0.49
austin
-0.46
joey
-0.45
slv
-0.42
superman
-0.42
harley
-0.42
madison
-0.42
over
-0.42
层面
-0.41
POSITIVE LOGITS
.#
0.90
$\#
0.83
\#
0.83
#
0.81
solidar
0.80
\#
0.78
"#
0.75
/#
0.74
utop
0.74
'#
0.74
Activations Density 0.066%