INDEX
Explanations
negations, particularly related to actions or behaviors that are not being done
instances of the word "not" or phrases indicating negation
New Auto-Interp
Negative Logits
Expansion
-0.70
stakes
-0.67
Pros
-0.66
Films
-0.65
eers
-0.64
Companies
-0.64
Circuit
-0.62
Kry
-0.61
Basics
-0.61
Tours
-0.61
POSITIVE LOGITS
icably
1.41
epad
1.19
icable
1.15
necessarily
1.09
hin
1.07
ched
0.97
orious
0.93
ifying
0.91
ices
0.89
necess
0.88
Activations Density 0.176%