INDEX
Explanations
references to engaging with the natural world and escaping civilization
New Auto-Interp
Negative Logits
downs
-0.15
wa
-0.14
Decide
-0.14
discrim
-0.14
ming
-0.14
Comple
-0.13
sanitize
-0.13
longleftrightarrow
-0.13
resett
-0.13
comm
-0.13
POSITIVE LOGITS
Use
0.20
don
0.19
Use
0.18
start
0.18
use
0.17
eph
0.17
don
0.17
Start
0.16
make
0.16
Don
0.16
Activations Density 0.261%