INDEX
Explanations
statements expressing a point of view or making a claim
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1129
+0.08
0.2%
1899
+0.08
0.2%
100
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
100
+0.08
0.05
239
+0.08
0.04
247
+0.07
0.04
Negative Logits
boks
-0.60
erk
-0.59
traktor
-0.59
Rgds
-0.59
koc
-0.56
stik
-0.55
anse
-0.54
reger
-0.54
skr
-0.54
spion
-0.53
POSITIVE LOGITS
.-"
0.64
shayari
0.61
🤣🤣
0.60
😭😭
0.59
milf
0.57
CARTOON
0.56
ciebie
0.56
soulign
0.55
faggot
0.55
theirs
0.55
Activations Density 0.298%