INDEX
Explanations
references to time indicators and specific locations in the text
New Auto-Interp
Negative Logits
–and
-0.49
–
-0.45
–↵↵
-0.44
.–
-0.40
––
-0.35
âĶĢâĶĢ
-0.31
————
-0.26
âĢIJâĢIJ
-0.24
)—
-0.24
,—
-0.24
POSITIVE LOGITS
-
0.99
-↵
0.69
-↵↵
0.55
-,
0.49
-.
0.46
-(
0.43
-*
0.40
-:
0.37
-$
0.36
-=
0.33
Activations Density 0.336%