INDEX
Explanations
negations combined with specific locations or contexts
phrases indicating exceptions or limitations
New Auto-Interp
Negative Logits
omas
-0.68
icides
-0.67
bath
-0.66
shown
-0.65
ihar
-0.62
icide
-0.61
inea
-0.61
inus
-0.59
ao
-0.59
ubi
-0.58
POSITIVE LOGITS
vous
0.66
ecast
0.63
hap
0.62
Admir
0.61
atican
0.59
ones
0.58
former
0.58
owitz
0.58
wives
0.57
part
0.56
Activations Density 0.062%