INDEX
Explanations
references to academic positions and affiliations
New Auto-Interp
Head Attr Weights
0:0.02
1:0.01
2:0.08
3:0.05
4:0.08
5:0.02
6:0.04
7:0.24
8:0.03
9:0.04
10:0.22
11:0.12
Negative Logits
playlist
-1.47
slang
-1.32
Zombies
-1.32
Arcade
-1.32
listeners
-1.30
ッ
-1.29
黒
-1.28
オ
-1.28
Tube
-1.27
ーン
-1.26
POSITIVE LOGITS
Theodore
1.43
reconstruct
1.34
inally
1.32
rejuven
1.28
ijah
1.28
reversible
1.27
edi
1.25
uctor
1.25
uncover
1.24
rex
1.24
Activations Density 0.003%