INDEX
Explanations
informative statements or explanations
verbs and phrases indicating the provision of information or summaries
New Auto-Interp
Negative Logits
Initialized
-0.77
azo
-0.76
doms
-0.72
mare
-0.68
assault
-0.68
psc
-0.66
LINE
-0.64
amphetamine
-0.64
AME
-0.64
ahu
-0.63
POSITIVE LOGITS
examples
1.24
insights
1.16
insight
1.12
explanations
1.11
detailed
1.09
links
1.07
insightful
1.03
concise
1.03
pointers
1.03
descriptions
1.02
Activations Density 0.246%