INDEX
Explanations
expressions of gratitude and importance
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.12
0.3%
1131
+0.07
0.2%
735
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1774
+0.12
0.02
791
+0.07
0.03
1801
+0.07
0.02
Negative Logits
fte
-1.02
?...
-0.99
secon
-0.95
wald
-0.94
idr
-0.94
wien
-0.94
fta
-0.94
§.
-0.94
fff
-0.94
oun
-0.93
POSITIVE LOGITS
difference
0.63
venser
0.54
difference
0.50
setToolTip
0.49
twimg
0.47
차
0.46
helps
0.45
diferença
0.45
lot
0.45
createSlice
0.45
Activations Density 0.184%