INDEX
Explanations
references to the White House
New Auto-Interp
Negative Logits
uber
-0.16
nt
-0.15
nd
-0.15
AMES
-0.14
name
-0.14
ly
-0.14
da
-0.14
server
-0.14
_RA
-0.14
du
-0.14
POSITIVE LOGITS
House
0.31
house
0.27
House
0.25
house
0.25
hall
0.22
legg
0.22
hurst
0.21
aker
0.21
-collar
0.20
haven
0.20
Activations Density 0.014%