INDEX
Explanations
mentions of U.S. states and their relationship to specific policies or events
New Auto-Interp
Head Attr Weights
0:0.18
1:0.01
2:0.11
3:0.03
4:0.05
5:0.03
6:0.06
7:0.02
8:0.07
9:0.07
10:0.14
11:0.17
Negative Logits
ォ
-1.19
20439
-1.17
subtract
-1.17
Reviewer
-1.16
separ
-1.14
conven
-1.10
icides
-1.08
leve
-1.06
Redditor
-1.06
derogatory
-1.02
POSITIVE LOGITS
ember
1.34
Mars
1.32
phia
1.29
chell
1.25
respectively
1.22
Web
1.21
etc
1.17
runners
1.16
Share
1.16
Amazon
1.16
Activations Density 0.020%