INDEX
Explanations
mentions of the specific animal "duck"
references to ducks
New Auto-Interp
Negative Logits
Palestin
-0.83
occas
-0.76
agre
-0.75
ccording
-0.75
Interstitial
-0.72
accur
-0.70
conflic
-0.69
srf
-0.66
conclud
-0.66
exercised
-0.65
POSITIVE LOGITS
lings
1.22
weed
1.08
tails
0.99
ducks
0.98
duck
0.97
fish
0.91
bowl
0.90
aroo
0.89
Duck
0.89
lift
0.88
Activations Density 0.011%