INDEX
Explanations
phrases connecting one thing to a larger group or category, often emphasizing the diversity or quantity of examples
phrases indicating examples or instances
New Auto-Interp
Negative Logits
same
-0.87
ioned
-0.75
alos
-0.70
cffff
-0.69
iol
-0.67
then
-0.66
nob
-0.65
awaru
-0.64
never
-0.64
shared
-0.64
POSITIVE LOGITS
examples
1.07
example
1.01
scratching
1.01
sampling
0.98
iceberg
0.90
sample
0.90
illustration
0.86
symptom
0.86
glimpse
0.82
anecdotal
0.82
Activations Density 0.129%