The objective of this mechanism is to improve the quality of your content extraction.
Free trial |
Custom Legacy |
Basic |
Business |
Unlimited |
❌ |
✅ |
✅ |
✅ |
✅ |
The content extractor allows Semji to efficiently integrate your articles directly into the Semji editor. We analyze your site's pages to extract the editorial content.
This successful retrieval of your content enables you to work efficiently with your texts in Semji.
This operation is carried out by Semji, but there may be errors in content retrieval. From now on, your technical teams can customize the extraction of your content using specific attributes to exclude or include parts of your content.
How do these attributes work?
These attributes are inserted directly into your HTML tags
To include elements, you need to specify the attribute: data-content-include
<div data-content-include> children </div>
To exclude elements, set the attribute: data-content-exclude
<div data-content-exclude> children </div>
Good to know: To ensure that these attributes work properly, they should be placed as close as possible to the element (HTML tag) you wish to include or exclude, so that the rule takes precedence over any other inclusion or exclusion rules you set up yourself.
Some examples of rules for excluding or including :
I want to exclude an item :
<div data-content-exclude>
Content to exclude
</div>
I would like to include an item :
<div data-content-include>
Content to include
</div>
I want to exclude an item that was previously included :
<div data-content-include>
<p> This paragraph </p>
<p data-content-exclude> The paragraph I wish to exclude.</p>
</div>
The </p> element will be excluded.
I would like to include an item that was previously excluded:
<div data-content-exclude>
<p>Paragraph that will be excluded </p>
<h2 data-content-include> The paragraph I want to keep.</h2>
</div>
The <h2> element will be retained.
Good to know: If an element contains both attributes (data-content-include and data-content-exclude), the exclusion takes precedence and the content is deleted.
Your technical teams can now work autonomously to improve your content extractor.