INDEX
Explanations
explaining about abstract concepts
New Auto-Interp
Negative Logits
Which
0.65
which
0.54
duquel
0.51
Which
0.49
Others
0.48
suivantes
0.47
которой
0.47
Unknown
0.47
יים
0.47
只有一个
0.47
POSITIVE LOGITS
about
0.76
remediation
0.71
teamwork
0.70
how
0.68
storytelling
0.68
vandalism
0.66
tantamount
0.64
heartwarming
0.64
akin
0.64
indicative
0.63
Activations Density 0.632%