INDEX
Explanations
specific references or citations within a text
references to formal citations or documentation
New Auto-Interp
Negative Logits
nut
-0.84
Tycoon
-0.80
hm
-0.73
independ
-0.72
hma
-0.71
enthal
-0.70
anyahu
-0.70
pher
-0.70
kish
-0.70
nu
-0.68
POSITIVE LOGITS
citation
1.60
citations
1.29
Citation
1.08
cited
0.82
footnote
0.82
ibli
0.81
Footnote
0.81
cite
0.77
Forbidden
0.75
ENC
0.74
Activations Density 0.015%