INDEX
Explanations
phrases indicating proof and demonstration in academic contexts
New Auto-Interp
Negative Logits
ouden
-0.13
etik
-0.13
ivor
-0.13
agnost
-0.13
Recap
-0.13
elucid
-0.13
agar
-0.13
summarizes
-0.13
137
-0.12
aby
-0.12
POSITIVE LOGITS
shown
0.80
show
0.76
showed
0.69
show
0.68
shown
0.67
-show
0.64
Show
0.62
.show
0.61
SHOW
0.61
prove
0.60
Activations Density 0.261%