INDEX
Explanations
email addresses and domain-related patterns
New Auto-Interp
Head Attr Weights
0:0.08
1:0.03
2:0.05
3:0.13
4:0.02
5:0.09
6:0.05
7:0.13
8:0.06
9:0.04
10:0.17
11:0.10
Negative Logits
DRAG
-1.07
®
-1.01
theless
-0.93
�
-0.92
moreover
-0.91
DRAGON
-0.86
CONTROL
-0.85
-0.84
CARD
-0.82
ographically
-0.81
POSITIVE LOGITS
Twe
0.91
__
0.90
Politics
0.90
___
0.87
Story
0.86
omics
0.86
DonaldTrump
0.85
itbart
0.85
Kid
0.83
haw
0.82
Activations Density 0.040%