INDEX
Explanations
names of famous pop culture figures or groups
references to specific musical artists and cultural entities
New Auto-Interp
Negative Logits
itals
-0.86
lished
-0.84
lessly
-0.83
ital
-0.81
merce
-0.81
itism
-0.79
haps
-0.78
orously
-0.77
ificial
-0.77
istically
-0.76
POSITIVE LOGITS
Wings
1.06
Hearts
1.05
Ducks
0.98
Feet
0.96
Ones
0.95
Bears
0.92
Birds
0.86
Hands
0.86
Bones
0.86
Duck
0.86
Activations Density 0.148%