INDEX
Explanations
phrases indicating a level of importance or relevance towards specific topics or issues
instances of the word "concerned"
New Auto-Interp
Negative Logits
artifacts
-0.85
Bom
-0.72
obs
-0.70
robe
-0.68
fruit
-0.68
arb
-0.67
buff
-0.67
ingen
-0.67
ingers
-0.66
guided
-0.65
POSITIVE LOGITS
proble
0.73
citiz
0.72
trolling
0.70
atives
0.67
Schr
0.67
reon
0.67
Concern
0.67
cerned
0.66
NESS
0.66
ately
0.65
Activations Density 0.023%