INDEX
Explanations
references and citations within a text
brackets and their contents in the text
New Auto-Interp
Negative Logits
estab
-0.83
Engineers
-0.82
pudding
-0.76
ende
-0.71
plateau
-0.70
Franch
-0.68
Elon
-0.68
Horses
-0.68
upt
-0.67
Genie
-0.67
POSITIVE LOGITS
note
1.69
Pg
1.49
reviewed
1.34
4
1.33
src
1.32
8
1.32
7
1.32
5
1.31
6
1.30
...]
1.30
Activations Density 0.022%