INDEX
Explanations
words related to contrasting actions or concepts
phrases indicating a distinction between individual actions and societal influences
New Auto-Interp
Negative Logits
WATCHED
-0.77
mun
-0.69
ļéĨĴ
-0.68
uel
-0.60
Status
-0.59
Ͻ
-0.59
Flavoring
-0.58
tenance
-0.57
Strongh
-0.57
periodically
-0.56
POSITIVE LOGITS
necessarily
0.83
ones
0.80
nor
0.80
slightest
0.76
mention
0.70
anything
0.68
anymore
0.66
anywhere
0.65
YP
0.65
zes
0.63
Activations Density 0.161%