INDEX
Explanations
proper names, specifically related to politics or news events
references to Washington, D.C
New Auto-Interp
Negative Logits
order
-0.73
unct
-0.68
ordering
-0.66
fund
-0.66
gger
-0.65
odic
-0.65
other
-0.64
eworld
-0.64
hist
-0.64
lass
-0.63
POSITIVE LOGITS
ASHINGTON
1.29
WASHINGTON
1.18
aukee
1.11
ashtra
0.91
Washington
0.88
DC
0.87
sburgh
0.85
ADA
0.85
STATE
0.85
ukong
0.85
Activations Density 0.006%