INDEX
Explanations
instruction or task prompts
New Auto-Interp
Negative Logits
wonderful
0.46
Thats
0.44
olulu
0.42
thats
0.42
Awesome
0.41
mà
0.40
Bingo
0.39
Whats
0.39
wonderful
0.38
𝗸
0.38
POSITIVE LOGITS
Given
0.56
Given
0.51
Candidate
0.47
Reading
0.46
Candidates
0.46
주어진
0.45
You
0.45
आपने
0.45
Combining
0.44
Provided
0.43
Activations Density 0.017%