INDEX
Explanations
phrases related to planning, agency, and social action
New Auto-Interp
Negative Logits
atron
-0.17
енка
-0.15
thro
-0.15
ndx
-0.14
acin
-0.14
äch
-0.14
libertin
-0.14
etzt
-0.14
873
-0.14
awn
-0.14
POSITIVE LOGITS
support
0.16
save
0.15
material
0.15
prolong
0.14
live
0.14
lit
0.14
jump
0.14
the
0.14
Jump
0.14
Kob
0.13
Activations Density 0.262%