INDEX

Explanations

terms related to self-regulation

New Auto-Interp

Embeds

PlotsExplanationShow Test FieldDefault Test Text

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 dwind

-0.73

 cowardly

-0.73

 coward

-0.67

 pacif

-0.67

 abstinence

-0.67

 disbanded

-0.66

 nonviolent

-0.66

 nobility

-0.66

 brave

-0.65

 humane

-0.64

POSITIVE LOGITS

address

0.82

description

0.82

ident

0.78

rating

0.77

reports

0.74

rated

0.73

Rating

0.73

selection

0.72

eval

0.72

icer

0.71

Activations Density 0.018%