INDEX
Explanations
adjectives related to processes or abstract concepts
critical phrases related to significant consequences or risks
New Auto-Interp
Negative Logits
;
-0.75
,"
-0.72
%,
-0.70
.
-0.69
!,
-0.69
,'
-0.68
.,
-0.68
.;
-0.68
%;
-0.67
.]
-0.66
POSITIVE LOGITS
etheless
0.90
efully
0.83
newcom
0.67
teasp
0.65
sequently
0.65
urther
0.65
ventus
0.64
DragonMagazine
0.61
lly
0.61
PsyNetMessage
0.61
Activations Density 0.952%