INDEX
Explanations
references to objects such as straws and other similar physical objects
references to "straw" and related terms
New Auto-Interp
Negative Logits
ervation
-0.80
ccording
-0.77
olon
-0.77
ogue
-0.74
notor
-0.72
olitan
-0.71
itals
-0.70
ynt
-0.69
uria
-0.69
cial
-0.68
POSITIVE LOGITS
straw
1.20
backs
0.93
pipe
0.88
weights
0.87
mere
0.86
weight
0.85
poll
0.84
Straw
0.80
bare
0.79
bridge
0.79
Activations Density 0.013%