INDEX
Explanations
harm and negative consequences
New Auto-Interp
Negative Logits
letting
0.81
왤
0.70
太多
0.69
Too
0.69
Recent
0.69
too
0.67
Novel
0.66
Too
0.65
Uncommon
0.65
Recent
0.65
POSITIVE LOGITS
ruined
1.42
decreased
1.36
reduced
1.29
reduced
1.23
increased
1.18
increased
1.17
erode
1.12
Reduced
1.11
destroyed
1.11
distorted
1.10
Activations Density 0.311%