INDEX
Explanations
references to food items, specifically burritos
New Auto-Interp
Negative Logits
ober
-0.18
succ
-0.16
elly
-0.16
arem
-0.16
otre
-0.16
aÅĻ
-0.16
esp
-0.15
tiv
-0.15
elle
-0.15
epar
-0.14
POSITIVE LOGITS
rough
0.35
rows
0.33
rowing
0.33
rito
0.32
row
0.29
ritos
0.29
ied
0.29
leigh
0.28
undi
0.28
dock
0.28
Activations Density 0.009%