INDEX
Explanations
references to specific topics or themes within the text
New Auto-Interp
Negative Logits
outs
-0.18
ora
-0.18
ude
-0.17
ers
-0.16
itan
-0.16
out
-0.15
ering
-0.15
iggers
-0.15
poz
-0.15
ibt
-0.15
POSITIVE LOGITS
starter
0.25
æĿIJ
0.19
covered
0.19
areas
0.18
areas
0.17
iversary
0.16
starter
0.16
steller
0.16
matter
0.16
ÄijÃŃch
0.16
Activations Density 0.014%