INDEX
Explanations
model's conversational start
New Auto-Interp
Negative Logits
entail
0.34
mz
0.32
transgress
0.32
aligning
0.32
кови
0.31
interconnect
0.31
convening
0.30
aligned
0.30
parenteral
0.30
mastering
0.30
POSITIVE LOGITS
哈哈
0.41
Which
0.40
Yep
0.40
Haha
0.39
哈哈哈
0.37
Yep
0.36
איך
0.36
Friendly
0.36
That
0.35
esta
0.35
Activations Density 0.035%