INDEX
Explanations
phrases indicating a comparison or evaluation based on a certain criteria
New Auto-Interp
Negative Logits
rouse
-0.80
enh
-0.74
Correct
-0.74
orem
-0.71
anasia
-0.70
okes
-0.69
oked
-0.69
arez
-0.69
terior
-0.68
ernal
-0.68
POSITIVE LOGITS
how
0.98
recent
0.89
previous
0.70
recent
0.69
current
0.68
similarities
0.67
rumors
0.66
its
0.66
what
0.65
there
0.65
Activations Density 0.099%