INDEX
Explanations
references to physical walls
references to physical walls
New Auto-Interp
Negative Logits
Gene
-0.79
amate
-0.68
CENT
-0.67
ptive
-0.67
lihood
-0.66
forward
-0.66
milo
-0.65
munition
-0.64
phrine
-0.62
×ķ
-0.62
POSITIVE LOGITS
papers
1.17
abies
1.14
aby
1.08
wall
0.92
clock
0.89
walls
0.89
top
0.85
wart
0.84
tops
0.82
thickness
0.82
Activations Density 0.012%