INDEX
Explanations
negative adjectives starting with 'un-' followed by descriptive words
negative prefixes, particularly "un," to identify words associated with negativity or absence
New Auto-Interp
Negative Logits
rows
-0.78
flats
-0.75
periphery
-0.72
racks
-0.71
bruises
-0.70
indoors
-0.70
chained
-0.70
separately
-0.70
stalls
-0.70
deletion
-0.69
POSITIVE LOGITS
help
1.48
productive
1.42
balanced
1.38
important
1.35
interesting
1.32
professional
1.30
readable
1.27
original
1.27
inspired
1.26
ruly
1.26
Activations Density 0.029%