INDEX
Explanations
phrases that contrast two different options
phrases indicating contrast or comparison
New Auto-Interp
Negative Logits
enegger
-0.73
inho
-0.71
ells
-0.67
boys
-0.66
Chips
-0.64
Loaded
-0.63
avan
-0.63
liam
-0.63
obyl
-0.61
Bras
-0.61
POSITIVE LOGITS
itably
0.87
necessarily
0.75
isons
0.69
opposed
0.69
âĶĢâĶĢâĶĢâĶĢ
0.69
untarily
0.68
materially
0.68
entimes
0.68
viously
0.67
willingly
0.67
Activations Density 0.010%