INDEX
Explanations
model answers
the token that marks the model/assistant’s turn in a chat transcript.
New Auto-Interp
Negative Logits
ESM
0.33
пользователя
0.33
्योर
0.31
रव
0.31
большой
0.31
úrov
0.31
большим
0.30
ロ
0.30
Cuánto
0.30
伡
0.30
POSITIVE LOGITS
answer
0.35
Sources
0.33
Wikipedia
0.33
Sources
0.31
answer
0.31
sources
0.31
Answer
0.30
Mutual
0.30
Guinness
0.29
Sadly
0.29
Activations Density 0.006%