INDEX
Explanations
expressions and references to romantic themes and relationships
New Auto-Interp
Negative Logits
upa
-0.18
auty
-0.18
isters
-0.17
onia
-0.16
itoris
-0.16
ernals
-0.15
itarian
-0.15
umb
-0.15
manship
-0.14
nie
-0.14
POSITIVE LOGITS
ized
0.18
izing
0.18
Rom
0.17
ting
0.17
ism
0.17
rom
0.16
atic
0.16
_rom
0.16
notions
0.16
_callable
0.16
Activations Density 0.019%