INDEX
Explanations
punctuation marks, specifically colons, in the text
punctuation marks or separators in the text
New Auto-Interp
Negative Logits
behavi
-0.94
igue
-0.79
inement
-0.76
itol
-0.73
objects
-0.72
etheless
-0.72
ines
-0.69
tremend
-0.68
undai
-0.68
behav
-0.67
POSITIVE LOGITS
TBD
0.87
Logged
0.81
Nom
0.67
76561
0.66
Yeah
0.66
ËĪ
0.66
fff
0.64
Who
0.64
Bye
0.63
Huh
0.63
Activations Density 0.079%