INDEX
Explanations
phrases describing specific types or instances of items or concepts
phrases that introduce examples or instances
New Auto-Interp
Negative Logits
ombat
-0.73
ribution
-0.72
antage
-0.66
dollar
-0.66
orem
-0.65
ushima
-0.63
emi
-0.63
iliate
-0.62
Cause
-0.60
Bore
-0.60
POSITIVE LOGITS
ties
0.79
cond
0.74
things
0.70
Osw
0.66
odon
0.65
types
0.61
requ
0.61
embodiments
0.61
necess
0.60
prec
0.60
Activations Density 0.030%