INDEX
Explanations
references to filenames and file-related terminology
New Auto-Interp
Negative Logits
e
-0.82
a
-0.81
st
-0.69
es
-0.68
us
-0.65
k
-0.65
o
-0.65
st
-0.61
ل
-0.58
i
-0.58
POSITIVE LOGITS
filename
2.64
filename
2.59
Filename
2.11
FILENAME
1.90
Filename
1.81
filenames
1.78
FileName
1.60
filenames
1.49
文件名
1.47
fileName
1.38
Activations Density 0.065%