INDEX
Explanations
references to doors and their associated actions or qualities
New Auto-Interp
Negative Logits
vrier
-0.17
tl
-0.17
adele
-0.16
tu
-0.16
tank
-0.16
dic
-0.15
tlement
-0.15
ertools
-0.15
OfWork
-0.15
edd
-0.15
POSITIVE LOGITS
ways
0.45
bell
0.40
frame
0.32
keeper
0.29
steps
0.28
/window
0.26
frames
0.26
WAYS
0.26
step
0.25
knob
0.25
Activations Density 0.042%