INDEX
Explanations
phrases about acting in someone's best interests or moral beliefs
interest, interests
New Auto-Interp
Negative Logits
covers
-0.69
Covers
-0.67
cover
-0.63
cover
-0.62
COVER
-0.59
covers
-0.59
Covers
-0.59
COVER
-0.58
Cover
-0.54
Cover
-0.53
POSITIVE LOGITS
interests
1.98
Interests
1.70
interest
1.69
interests
1.66
INTEREST
1.55
Interest
1.50
interest
1.47
Interest
1.41
Interests
1.32
INTEREST
1.28
Activations Density 0.726%