INDEX
Explanations
direct statements and acknowledgements
New Auto-Interp
Negative Logits
!!
1.01
!
0.97
!!!!!
0.91
!!!
0.91
!...
0.88
!
0.86
!!!!
0.86
!
0.84
等等
0.83
!!
0.82
POSITIVE LOGITS
sobering
0.98
oterapia
0.98
bienvenida
0.96
regrett
0.95
مني
0.95
welcome
0.94
preferable
0.94
讵
0.91
hardly
0.90
pretty
0.88
Activations Density 0.057%