INDEX
Explanations
words or phrases related to entertainment and sensationalism
New Auto-Interp
Negative Logits
indr
-0.15
DRAM
-0.14
reation
-0.14
abus
-0.14
DESC
-0.14
daytime
-0.14
Hoch
-0.13
ænd
-0.13
Shapiro
-0.13
avou
-0.13
POSITIVE LOGITS
UME
0.17
tü
0.16
orta
0.15
packed
0.14
j
0.14
enity
0.14
yn
0.14
yc
0.14
.jackson
0.14
yna
0.14
Activations Density 0.248%