INDEX
Explanations
phrases expressing opinions or evaluations
phrases that reference comparable situations or events
New Auto-Interp
Negative Logits
eteria
-0.81
ourse
-0.71
elin
-0.69
iband
-0.65
esson
-0.65
utenberg
-0.65
ells
-0.63
iven
-0.63
arate
-0.63
iets
-0.63
POSITIVE LOGITS
lihood
1.32
ours
1.24
hers
1.05
yours
0.99
theirs
0.93
pires
0.81
liest
0.77
minded
0.73
lier
0.70
Deng
0.68
Activations Density 0.105%