INDEX
Explanations
references to celebrities
mentions of celebrities
New Auto-Interp
Negative Logits
nerg
-0.70
THER
-0.64
gger
-0.61
Agg
-0.60
abus
-0.59
ña
-0.59
Wonderland
-0.59
condition
-0.58
plet
-0.58
yg
-0.58
POSITIVE LOGITS
rities
1.38
celebrities
1.03
hips
0.84
ervative
0.82
endors
0.80
ervatives
0.80
cele
0.78
Celeb
0.76
ãħĭ
0.75
Cosponsors
0.73
Activations Density 0.013%