INDEX
Explanations
punctuation marks following clauses
comas and phrases indicating continuations or elaborations on previously stated ideas
New Auto-Interp
Negative Logits
ode
-0.61
iasm
-0.56
eers
-0.56
CHA
-0.55
Score
-0.54
Reward
-0.53
igers
-0.53
sth
-0.52
ODE
-0.52
è¦ļéĨĴ
-0.51
POSITIVE LOGITS
unlike
1.20
contrary
1.09
despite
1.09
although
1.06
despite
1.05
irrespective
0.98
regardless
0.96
barring
0.96
whereas
0.95
insofar
0.94
Activations Density 0.095%