INDEX
Explanations
phrases or contexts indicating something is beneath a certain standard or level
New Auto-Interp
Negative Logits
self
-0.24
ever
-0.23
time
-0.21
ology
-0.20
gether
-0.20
ical
-0.20
wide
-0.20
plier
-0.19
scopic
-0.19
icient
-0.19
POSITIVE LOGITS
pin
0.35
whelming
0.33
lining
0.31
pins
0.30
lined
0.30
lay
0.29
whel
0.28
privileged
0.27
lies
0.26
util
0.26
Activations Density 0.016%