INDEX
Explanations
references to news outlets and media sources
New Auto-Interp
Negative Logits
veter
-0.67
animate
-0.64
diaper
-0.59
harms
-0.56
causal
-0.55
atible
-0.54
surgical
-0.54
parity
-0.54
pires
-0.54
doesnt
-0.52
POSITIVE LOGITS
.
0.84
quoted
0.77
.</
0.76
rhet
0.74
quoting
0.73
.]
0.70
.).
0.69
sarcast
0.69
."
0.67
lied
0.67
Activations Density 0.145%