INDEX
Explanations
phrases related to making choices or trade-offs
terms related to trade-offs and evaluations of value
New Auto-Interp
Negative Logits
bered
-0.74
urses
-0.73
arus
-0.73
late
-0.71
liter
-0.69
miah
-0.68
bill
-0.68
attery
-0.65
bus
-0.65
estic
-0.65
POSITIVE LOGITS
downside
0.92
why
0.86
Problem
0.83
why
0.83
WHY
0.79
weaknesses
0.78
lesson
0.78
Problem
0.76
drawback
0.76
takeaway
0.75
Activations Density 0.531%