INDEX
Explanations
phrases with contrasting elements or terms
New Auto-Interp
Negative Logits
oun
-0.54
anwhile
-0.53
prus
-0.52
kefeller
-0.52
Vaugh
-0.51
namely
-0.50
nomine
-0.49
nodd
-0.48
"},"
-0.47
tiss
-0.45
POSITIVE LOGITS
,
1.24
,,
1.10
*,
0.98
,...
0.97
.,
0.95
?,
0.92
!,
0.90
,.
0.86
%,
0.83
,
0.82
Activations Density 1.228%