INDEX
Explanations
phrases indicating the concept of "opposites" or contrasting ideas
New Auto-Interp
Negative Logits
lac
-0.16
adipiscing
-0.16
IDD
-0.16
lings
-0.15
lin
-0.15
self
-0.14
istics
-0.14
ling
-0.14
lis
-0.14
iri
-0.14
POSITIVE LOGITS
-sex
0.20
/op
0.19
extremes
0.18
extreme
0.17
veau
0.17
nhau
0.17
effect
0.17
direction
0.16
.Toolkit
0.16
-direction
0.15
Activations Density 0.021%