HTML Cleaner & Markup Sanitizer — Documentation
Privacy first.
We do not store your content or your documents. Files are processed in memory and discarded immediately after the cleaned output is returned to your browser.
What is HTML cleaning?
HTML cleaning is the process of removing presentational, proprietary, and unsafe markup from a fragment of HTML — typically because the markup came from a tool (Word, Google Docs, an old WYSIWYG editor) that injected a lot of styling and ID-attribute noise on top of the actual content. A cleaner output is easier to embed in a CMS, paste into an email template, or feed into a Markdown converter.
Why Word creates messy HTML
Microsoft Word treats HTML as an export format for visual fidelity, not for content portability. When you "Save as HTML" or copy and paste Word content into a web editor, you get:
- Inline style attributes on nearly every tag (margins, font-family, font-size, color).
- Word-specific class names (MsoNormal, MsoListParagraph) that no CSS in your site will ever style.
- Microsoft conditional comments (<!--[if mso]>...<![endif]-->) that target old versions of Outlook.
- Namespaced XML tags (<o:p>, <w:document>) and xmlns declarations.
- Empty <span> wrappers and empty <p> blocks left behind by tracked-change history.
None of this affects how the content looks once you re-render it cleanly in HTML. Stripping it shrinks the document, makes it portable, and removes incidental browser-quirks Word relied on.
Mode presets — when to use which
The "Mode preset" dropdown picks a curated set of cleaning options for a specific destination. The advanced checkboxes below the dropdown let you tweak any individual switch if a preset is close to what you want but not quite. "Custom" mode honours every checkbox literally — no preset overrides.
Custom
Honours every checkbox as-is. Pick this when you want fine-grained control and you already know which switches you need on or off. The default checkbox state strips inline styles + classes + empty tags + spans + comments + MSO noise, which is a reasonable middle ground for most pasted Word content.
Best for: one-off cleanups where you want to see exactly what is being touched.
Minimal Clean
The lightest preset. Strips only inline style attributes, HTML comments, and Microsoft Office noise (MSO conditional comments, Word's namespaced tags, mso-* CSS declarations). Classes, IDs, empty tags, and <span> wrappers are preserved.
Best for: cleaning Word/Google Docs output that you intend to paste straight into a CMS that already styles its own classes (e.g. you have a design system whose CSS expects "<p class=\"lead\">" and you want to keep author-applied classes).
Aggressive Clean
Strips everything that's typically Word/editor noise: inline styles, classes, IDs, empty tags, <span> wrappers, comments, MSO noise, and then minifies the whitespace. Tracking parameters in href values are NOT stripped (use the dedicated checkbox or pick Email-Safe / Security-Harden).
Best for: producing a tiny, semantic HTML payload from a heavily-formatted source. Output is portable but compressed to one line — pair with the "Auto-format" toggle if you want human-readable indentation.
Email-Safe Mode
Strips classes, IDs, comments, MSO conditional blocks, and click-tracking URL parameters — but KEEPS inline style attributes (since most email clients don't honor <style> blocks or external CSS). The output is minified to reduce the chance of a mail client rewriting whitespace into rendering artifacts.
Best for: preparing newsletter, transactional, or marketing email HTML. Tracking params get stripped because spam filters dislike them and they often duplicate the same tracking the sending platform adds anyway.
Framework-Ready
Aggressive scrub PLUS class and ID strip, then minified. Output retains only the semantic tag structure — paragraphs, headings, lists, links, emphasis — with no styling hooks attached.
Best for: pasting into a CSS framework like Tailwind / Bootstrap / utility-first systems where you'll add framework classes after the fact, and you don't want the inherited author classes to fight your framework.
WordPress Clean
Strips inline styles, classes, IDs, empty tags, spans, comments, and MSO noise. Output is NOT minified — the block editor parses cleanest from readable HTML. Preserves all semantic tags so Gutenberg can recognize paragraphs, headings, and lists as their respective blocks.
Best for: pasting cleaned content into the WordPress block editor (Gutenberg) where the source already has correct <h2>, <p>, <ul> markup. If your source is flat unstructured Word text and you want auto-detection of headings + lists, use the next preset instead.
Accessibility
Strips inline styles and classes (presentational noise that overrides user / AT-defined preferences) but KEEPS IDs — they're routinely referenced by aria-labelledby, aria-describedby, and skip-link targets. Preserves all semantic tags and empty tags (in case of an aria-live region with conditional content). Output is not minified, which keeps it diffable for accessibility QA.
Best for: public-sector or compliance-focused publishing where the rendered output must respect user contrast / font-size preferences and where ARIA wiring must survive the cleanup.
Security-Harden
The most aggressive preset: strips every attribute (keeping only href / src / alt for semantic correctness), all classes and IDs, all MSO noise, all tracking parameters from URLs, and minifies the result. Inline styles and <span> wrappers go too. Output is the bare minimum HTML that conveys structure and links.
Best for: accepting content from untrusted third parties (user-generated content moderation, vendor-supplied marketing copy), then re-styling everything on your side. Pair with the link-behavior toggles if external links will appear in the output.
Option reference (advanced checkboxes)
Every option below applies in any mode. When a preset is selected, the preset's defaults overlay your checkbox choices for the standard scrub flags; explicit toggles for link behavior and formatting (auto-format / minify) always carry your per-run choice.
- Remove inline styles
- Drops every style="…" attribute. Off this if you're preparing HTML for email — email clients only honor inline styles.
- Remove classes
- Drops every class="…" attribute. Useful when pasting into a different design system; harmful when your destination CMS relies on specific classes.
- Remove IDs
- Drops every id="…" attribute. Disable for Accessibility mode if your downstream tooling references IDs via aria-labelledby / skip-link anchors.
- Remove empty tags
- Drops tags whose children are all whitespace. Void elements (img, br, hr, input) are preserved regardless because they're inherently childless.
- Collapse extra whitespace
- Three normalizations in one pass: (1) runs of 2+ ASCII spaces inside text nodes collapse to a single space (preserved inside <pre> and <code> where whitespace is significant); (2) leading and trailing <br> tags get trimmed from block-level elements (including <br> wrapped in inline formatting like <strong>); (3) blocks like <p><strong><br></strong><strong><br></strong>…</p> — which have inline children around line breaks but no visible text — are dropped entirely. Catches the Word-paste artifact where line breaks get re-wrapped in <strong> for no semantic reason. Enabled by default in every preset except Minimal.
- Remove <span> wrappers
- Unwraps <span>: keeps the children, drops the wrapper. Word inserts many empty or styling-only spans; removing them shrinks the output significantly without losing content.
- Remove comments
- Drops <!-- … --> blocks AFTER the parser has loaded the document — comment nodes are removed from the DOM as a separate pass. Use this for normal, well-formed comments.
- Hide comments (pre-parse)
- Strips <!-- … --> blocks and everything inside them via regex BEFORE the HTML parser runs. Stronger than "Remove comments": comment contents (including malformed or Word-conditional markup) can never leak into the parsed tree. Use this when comments contain HTML-like content that confuses the standard cleanup pass — e.g. Outlook's <!--[if mso]>…<![endif]--> conditionals carrying Word-only fallback markup.
- Remove MSO / Word noise
- Strips Microsoft Office artifacts: <!--[if mso]> conditional comments, <o:p> and <w:*> namespaced tags, xmlns declarations, mso-* CSS, and Word's MsoNormal class family. Highly recommended for any Word/Outlook-pasted content.
- Remove ALL attributes
- Strips every attribute except href / src / alt. The most aggressive switch short of "Strip ALL HTML"; used by the Security-Harden preset.
- Remove tracking params
- Strips utm_source, utm_medium, utm_campaign, fbclid, gclid, msclkid, mc_cid, hubspot _hsenc/_hsmi, and ~20 other common click-tracking parameters from href values. Used by Email-Safe (spam filters dislike them) and Security-Harden.
- External links: nofollow + new tab
- Adds rel="nofollow noopener noreferrer" + target="_blank" to every <a> whose host differs from your site's domain. Existing rel tokens are preserved and merged. See the "External and internal link behavior" section below for details on how internal/external is determined.
- Internal PDFs: open in new tab
- Adds target="_blank" + rel="noopener noreferrer" to same-site links whose URL path ends in .pdf. nofollow is NOT added — internal PDFs don't bleed link equity off the site. Catches relative paths ("/docs/file.pdf") as well as absolute same-domain URLs.
- Auto-format HTML structure (WordPress / Gutenberg)
- Activates a heuristic post-process that runs AFTER the standard scrub. Promotes short title-case lines to <h2>, demotes follow-up subsections to <h3>/<h4>, wraps "?"-terminated questions as <h3> FAQ headings, collapses bullet/numbered marker runs into <ul>/<ol>, and re-wraps loose text in <p>. Stackable with every preset and with Custom — you can combine it with Hide comments, the link toggles, anything else. Sub-options (heading sensitivity, start at H2, allow H1, FAQ format, list detection, smart-quote normalization) appear directly below the advanced grid when this is checked. See the dedicated section below for the full rubric.
- Strip ALL HTML (plain text)
- Returns plain text only. Overrides every other option — use when you want just the words, no markup. Scripts and styles are stripped first so their bodies don't leak through as text.
- Minify output
- Collapses whitespace between tags and runs of internal whitespace into single spaces. Cannot be combined with Auto-format — the picker toggles the other off automatically.
- Auto-format (indent)
- Indents output with tabs and newlines so block-level structure is easy to scan. Inline elements (<a>, <strong>, <em>, <code>) stay on one line so emphasis doesn't break. Cannot be combined with Minify.
- Preserve semantic structure
- Reserved switch — currently always-on internally. Documented for forward compatibility in case a future mode opts to flatten heading hierarchy.
External and internal link behavior
The cleaner can rewrite anchor tags as part of the cleanup pass. Two independent toggles control this:
How internal vs external is determined
When either link-behavior toggle is checked, a "Site domain" field appears in the advanced options panel. It is auto-detected from this WordPress site's home URL (so for this installation it's already filled in) and is editable in case you are cleaning content destined for a different site. The cleaner extracts the host from each <a href>, lowercases it, drops a leading "www.", and compares it against your domain:
- Same host → internal.
- Different host → external.
- Relative URL with no host (e.g. "/about" or "docs/file.pdf") → internal.
- mailto: / tel: / sms: / javascript: → never rewritten.
- Pure-anchor links ("#section") → skipped.
Privacy
The Site domain you enter is NEVER stored. It is consumed inline during the cleanup pass and dropped when the response is built. No analytics row, no log line, no setting carries the domain value forward. The auto-detected default is computed from this WordPress site's home_url() at render time — it isn't kept either.
External links: nofollow + new tab
Sets rel="nofollow noopener noreferrer" and target="_blank" on every external link. The three rel tokens travel together intentionally:
- nofollow — tells search engines not to follow the link for ranking purposes. Standard hygiene for author-supplied outbound links in user-generated content, blog articles, and quoted material.
- noopener — severs the target page's reference to your tab via window.opener. Without this, target="_blank" enables "reverse tabnabbing" — the linked site can navigate your tab to a phishing page.
- noreferrer — omits the Referer header on the navigation so the target site can't see which of your pages linked to it. Often used together with noopener for parity across older browsers.
Existing rel tokens are preserved — if a link already has rel="sponsored", the result is rel="sponsored nofollow noopener noreferrer". target is set to _blank unconditionally for external links.
Internal PDFs: open in new tab
Same-site links whose URL path ends in ".pdf" (case-insensitive) get target="_blank" + rel="noopener noreferrer". nofollow is intentionally NOT added because the PDF is hosted on your own domain — there's no SEO concern. Detection covers absolute same-domain URLs ("https://example.com/docs/file.pdf"), root-relative paths ("/docs/file.pdf"), and protocol-relative paths.
Why this exists: PDFs often render in the browser's built-in viewer or trigger a download dialog. Either way, opening them in a new tab preserves the user's place in your article.
Auto-format (pretty-print) vs Minify
These two options are mutually exclusive — they do opposite things. Picking one in the UI auto-clears the other.
- Auto-format (indent) — walks the cleaned DOM and re-emits with tab indentation + newlines at block boundaries. Block-level tags (<p>, <h2>, <ul>, <section>, <article>, <header>, etc.) start on their own indented line. Inline tags (<a>, <strong>, <em>, <code>, <span>, <time>, <small>) stay on the same line as their surrounding text so emphasis and links don't fragment. Void tags (<br>, <img>, <hr>) emit as self-closing without a partner. Pick this when you want output you can hand off to another developer or read in a diff tool.
- Minify — collapses all whitespace between tags into nothing and all internal whitespace into single spaces. Output is one long line. Pick this when you're embedding into a page where bytes matter (email payload size, inline shortcode content) or when downstream tooling is whitespace-sensitive.
If neither is selected, the cleaner emits the natural output of the DOM serializer — usually one long line because the parse step doesn't preserve source whitespace.
What are MSO tags?
MSO is shorthand for "Microsoft Office." The MSO tag family covers any tag, attribute, comment, or style declaration that exists to round-trip a document back to Word. This includes:
<!--[if mso]> — conditional comment blocks Outlook honors but every other client ignores.
class="MsoNormal" — the base paragraph style Word stamps on every paragraph.
mso-margin-top-alt: … — CSS declarations whose names start with mso-, meaningful only to Word's renderer.
- <o:p>, <w:document>, <st1:place> — namespaced tags from Office namespaces.
Auto-format HTML structure (WordPress / Gutenberg)
A standalone checkbox in the advanced cleaning options. When checked, after the safety scrub finishes, a heuristic post-process pass re-emits the content as Gutenberg-safe HTML — paragraphs wrapped in <p>, bullet runs collapsed into <ul> / <ol>, and likely headings promoted to <h2> / <h3>. Because it lives alongside the other advanced toggles, you can pair it with any preset (Minimal, WordPress Clean, Security-Harden, etc.) or with Custom, and combine it with Hide comments / the link-behavior toggles / anything else.
This is helpful when you start from a flat .docx or a Google Docs export where headings only "look like headings" (bold, larger font) without semantic markup behind them.
How heading detection works
Each block of cleaned text is scored by a small set of heuristics: short length, title-case capitalization, question-mark or colon endings, and whether the entire visible content was bold in the source. Long blocks (more than 120 characters) are never promoted to headings regardless of signal strength. By default, the first level emitted is H2 — your WordPress theme already provides the H1.
You can tune sensitivity (Low / Medium / High) per run. Low is conservative — only obvious headings get promoted. High catches more candidates at the cost of occasional false positives. Medium is the default.
Safety boundary
The Auto-Structure pass runs ONLY after the standard cleaner has stripped scripts, style blocks, event handlers, javascript: URLs, and inline styles. It reads cleaned visible text — never raw Word formatting signals — so it can never reintroduce unsafe markup. The pass produces only the following tag families: <p>, <h2>, <h3>, <h4>, <ul>, <ol>, <li>, <strong>, <em>, <a>.
Email compatibility
Email clients do not honor external stylesheets and many do not honor <style> blocks. The "Email Safe Mode" preset keeps inline style attributes but removes classes, IDs, MSO conditionals, comments, and tracking parameters in href values. Output is ready to drop into an email template without further changes.
Security hardening
Every cleaning pass strips active code regardless of options:
- <script>, <iframe>, <object>, <embed>, <form>, <input>, <button>, <link>, <meta>, <base> — removed.
- on*= event handler attributes (onclick, onmouseover, ...) — removed.
- href / src values starting with javascript:, vbscript:, data:text/html, data:image/svg+xml, file: — javascript / vbscript href values are rewritten to # (preserving the anchor text); src values are removed entirely.
The Security Harden preset additionally removes every non-essential attribute (keeping only href / src / alt for semantic correctness), every class and ID, and every tracking parameter from URLs.
Accessibility
The Accessibility preset preserves semantic tags (headings, lists, paragraphs) and IDs (used by aria-labelledby) while stripping inline styles and class names that override user / assistive-technology preferences. The output respects user-defined font sizes, contrast modes, and screen-reader cues.
File upload privacy guarantee
When you upload a file:
- The browser uploads it over HTTPS to a temporary location PHP manages.
- Our processor opens the file, reads the bytes into memory, and immediately unlink()s the temporary file from disk.
- The bytes are cleaned in memory and the result is returned in the same HTTP response.
- When the response is sent, the in-memory copy is released. Nothing is written to /uploads, nothing is written to the database except a metadata row (mode preset, file extension, byte counts, IP, timestamp) used solely to surface abuse.
Frequently asked questions
- Do you store my document?
- No. Uploaded files are processed in memory and the temp file is deleted before the response is built. We have no other storage path for document bytes.
- Is my paste saved anywhere?
- No. Pasted HTML never reaches disk and is never written to the database. It lives in PHP memory for the duration of the request only.
- What file types are supported?
- .docx (modern Word), .rtf, .html / .htm, .txt. Legacy .doc files are not supported — please save as .docx in Word first.
- Can this break my formatting?
- Yes, intentionally. A cleaner removes presentational markup. Pick a less-aggressive preset (Minimal Clean) if you want to keep more of the original styling. Use the diff viewer to see exactly what changed.
- Is it safe to paste content from confidential documents?
- The tool itself does not store or log any of your content. However, the bytes do traverse the network to reach our server. If you would not be comfortable emailing the same content over HTTPS, you should not paste it here either. Treat this like any other HTTPS-connected webform.
- What if my .docx is password-protected?
- The extractor detects encrypted .docx files (the document body is encrypted, not stored at word/document.xml) and returns a clear error. Decrypt the file in Word first, then re-upload.
- What is the maximum input size?
- Configured per-site by the administrator. Defaults: 1 MB pasted input, 4 MB uploaded document. Both ceilings are enforced server-side after the request reaches us, so they cannot be bypassed from the browser.
- Will "WordPress / Gutenberg Ready" change my content?
- Yes — that's the point. It re-emits flat text as semantic HTML (paragraphs, headings, lists). Visual content is preserved; only the markup changes. Long blocks are never promoted to headings, so paragraphs stay paragraphs. Use the diff viewer to see exactly what changed before copying the output.
- Why does Auto-Structure default to starting at H2?
- Most WordPress themes render the post title as H1. Adding another H1 inside the body would create two top-level headings on the page — which breaks the document outline and confuses screen readers and search engines. The "Allow H1" toggle is available if your theme does not auto-emit one.
- What do the Auto-Structure sub-options do?
-
- <strong>Start at H2</strong> — When ON (default), the highest heading level promoted is H2. When OFF, text lines can be promoted to H1. Recommended ON for WordPress posts (where the post title is already H1).
- <strong>Allow H1</strong> — Only appears when "Start at H2" is OFF. Allows H1 headings to be promoted from title-case lines. Enable only if your theme does not emit a post title as H1.
- <strong>FAQ format</strong> — When ON, lines ending with "?" are promoted to H3 headings (as FAQ-style questions). When OFF, questions remain as regular paragraphs.
- <strong>List detection</strong> — When ON, sequences of bullet markers (•, -, *, etc.) and numbered markers are recognized and wrapped as <ul> or <ol>. When OFF, marker lines remain as plain text.
- <strong>Smart quotes</strong> — When ON, straight ASCII quotes (" and ') are normalized to typographic curly quotes (“ ” and ‘ ’). When OFF, quotes remain straight.
- <strong>Heading sensitivity</strong> — Controls how aggressive the heuristic is when detecting title-case lines as headings. Lower values (0–2) require stricter title case and recognize fewer headings. Higher values (3–5) are more lenient and recognize more headings from loose capitalization.
← Back to site