INDEX
Explanations
contractions of words, specifically finding instances of "didn't" with a strong activation value
negative contractions and phrases that express negation
New Auto-Interp
Negative Logits
planet
-0.75
amer
-0.71
rall
-0.64
accompan
-0.64
bard
-0.63
stre
-0.62
Britann
-0.62
antine
-0.62
rog
-0.61
Reviewer
-0.60
POSITIVE LOGITS
necessarily
1.03
exactly
1.02
gonna
0.95
quite
0.84
urtles
0.82
gotta
0.81
kidding
0.78
really
0.77
bother
0.76
even
0.75
Activations Density 0.069%