INDEX
Explanations
terms related to romance and romantic relationships
New Auto-Interp
Negative Logits
936
-0.16
auty
-0.16
inson
-0.15
manship
-0.15
wards
-0.15
isters
-0.15
rogen
-0.15
alet
-0.15
onders
-0.14
ulty
-0.14
POSITIVE LOGITS
ized
0.22
izing
0.20
ised
0.19
atic
0.16
ism
0.16
ting
0.16
ize
0.15
ising
0.15
ous
0.15
ization
0.15
Activations Density 0.014%