INDEX
Explanations
references to citations or sources within text
references to citations
New Auto-Interp
Negative Logits
inav
-0.88
pora
-0.74
merce
-0.71
ratulations
-0.69
milo
-0.68
Jet
-0.65
FTWARE
-0.65
sburg
-0.61
steen
-0.61
wear
-0.60
POSITIVE LOGITS
omitted
1.00
]"
0.92
=]
0.91
needed
0.91
][
0.89
]
0.84
])
0.83
?]
0.83
]).
0.81
],[
0.79
Activations Density 0.020%