INDEX
Explanations
listing specific examples or categories
New Auto-Interp
Negative Logits
some
1.08
both
1.00
各种
0.97
those
0.97
some
0.94
Some
0.93
Both
0.91
something
0.90
Some
0.90
några
0.87
POSITIVE LOGITS
people
1.03
instances
0.99
of
0.88
important
0.87
things
0.87
aspects
0.87
notable
0.79
factors
0.79
places
0.78
other
0.78
Activations Density 0.406%