INDEX
Explanations
words that introduce examples or instances
phrases that introduce examples or instances
New Auto-Interp
Negative Logits
ements
-0.68
alities
-0.67
orts
-0.65
atures
-0.64
lic
-0.60
orously
-0.59
fect
-0.59
forms
-0.57
unlaw
-0.57
Exit
-0.57
POSITIVE LOGITS
,,
0.68
mith
0.65
ðĿ
0.63
.,
0.63
ignt
0.62
owing
0.62
,.
0.60
liking
0.59
âĸ
0.58
onto
0.58
Activations Density 0.029%