INDEX
Explanations
references to popular snacks and comfort foods
New Auto-Interp
Negative Logits
Soup
-0.17
Soup
-0.16
MV
-0.15
soup
-0.15
itel
-0.15
Slash
-0.15
Blade
-0.15
_lambda
-0.15
ngine
-0.15
soup
-0.15
POSITIVE LOGITS
brittle
0.21
bars
0.17
reward
0.17
snack
0.16
Reward
0.15
bars
0.15
Geh
0.15
rewards
0.14
nyder
0.14
Nab
0.14
Activations Density 0.111%