INDEX
Explanations
phrases or contexts that convey curiosity or remarkability
New Auto-Interp
Negative Logits
uts
-0.74
arest
-0.73
oise
-0.72
avers
-0.71
aper
-0.71
required
-0.71
ussy
-0.70
uter
-0.68
otent
-0.67
reditation
-0.66
POSITIVE LOGITS
tid
0.98
Flavoring
0.85
sidel
0.83
twists
0.82
anecdotes
0.82
insights
0.82
trivia
0.82
arios
0.81
ness
0.78
observations
0.76
Activations Density 0.026%