INDEX
Explanations
hallucinations and fabrications
New Auto-Interp
Negative Logits
cautioned
0.73
criticised
0.72
唏
0.70
avoided
0.69
evas
0.69
lame
0.68
controversial
0.67
criticized
0.66
sluggish
0.64
dismal
0.63
POSITIVE LOGITS
believing
2.46
belief
2.20
believe
2.15
believe
1.94
believes
1.89
Believe
1.88
belief
1.84
Believe
1.83
以為
1.78
Belief
1.75
Activations Density 0.195%