Unsafe HTML filtering using regular expressions

ID

swift.unsafe_html_filtering

Severity

high

Resource

Incorrect validation

Language

Swift

Tags

CWE:116, CWE:185, CWE:186, CWE:20, NIST.SP.800-53, OWASP:2021:A3, PCI-DSS:6.5.7

Description

Matching HTML tags using regular expressions is difficult to do correctly and can lead to security issues such as cross-site scripting (XSS) vulnerabilities. While it is technically possible to match some simple HTML tags using regular expressions, doing so comprehensively is essentially impossible due to HTML’s complex parsing rules.

Browsers are extremely forgiving when parsing HTML, allowing many variations and edge cases that regular expressions cannot handle correctly. This makes it easy for attackers to craft input that bypasses regex-based HTML filters, leading to potential security vulnerabilities.

Rationale

The following is an example of vulnerable code that attempts to filter out <script> tags using a regular expression:

import Foundation

let script_tag_regex = try! NSRegularExpression(pattern: "<script[^>]*>.*</script>", options: .caseInsensitive)

var html = userInput
var old_html = ""

while html != old_html {
  old_html = html
  html = script_tag_regex.stringByReplacingMatches(
    in: html,
    options: [],
    range: NSRange(location: 0, length: html.count),
    withTemplate: ""
  )
}

This code attempts to remove all <script> tags from user input by repeatedly applying a regular expression replacement. However, this approach has several problems:

  1. Bypass through variations: Attackers can use variations like <script foo="bar">, <Script>, or <script\x00> that may not match the regex pattern.

  2. Nested tags: HTML parsers handle nested tags differently than regex engines, allowing bypasses through clever nesting.

  3. Incomplete coverage: The regex only targets <script> tags but doesn’t handle other XSS vectors like <img onerror=…​>, <iframe>, or event handlers.

  4. Parsing complexity: HTML has complex rules for attributes, comments, CDATA sections, and entity encoding that regex cannot properly handle.

Another example using Swift’s string replacement with regex:

let html = userInput.replacingOccurrences(
  of: "<[^>]*>",
  with: "",
  options: .regularExpression
)

This attempts to strip all HTML tags but fails to handle edge cases like: - <tag attr=">" /> (angle bracket in attribute) - <!-- <tag> -→ (tags in comments) - Malformed HTML that browsers still parse

Remediation

Instead of using regular expressions to filter HTML, use a well-tested HTML sanitization library. These libraries understand HTML parsing rules and can properly handle edge cases and malicious input.

For Swift/iOS applications, consider using:

  1. HTML parsing libraries: Use proper HTML parsers like SwiftSoup or HTMLKit that understand HTML structure.

  2. Web content filtering: For WebKit/WKWebView, use Content Security Policy (CSP) headers and proper configuration.

  3. Native string sanitization: For simple cases, use String encoding methods and avoid rendering raw HTML.

Example using proper HTML sanitization approach:

import WebKit

// Instead of regex filtering, use WKWebView with proper CSP
let config = WKWebViewConfiguration()
let userContentController = WKUserContentController()

// Set Content Security Policy
let cspScript = """
(function() {
    var meta = document.createElement('meta');
    meta.httpEquiv = 'Content-Security-Policy';
    meta.content = "default-src 'self'; script-src 'none'";
    document.head.appendChild(meta);
})();
"""

userContentController.addUserScript(
    WKUserScript(source: cspScript,
                 injectionTime: .atDocumentStart,
                 forMainFrameOnly: true)
)

config.userContentController = userContentController
let webView = WKWebView(frame: .zero, configuration: config)

For text content, use proper encoding:

// Instead of regex filtering, properly encode HTML entities
func escapeHTML(_ text: String) -> String {
    return text
        .replacingOccurrences(of: "&", with: "&amp;")
        .replacingOccurrences(of: "<", with: "&lt;")
        .replacingOccurrences(of: ">", with: "&gt;")
        .replacingOccurrences(of: "\"", with: "&quot;")
        .replacingOccurrences(of: "'", with: "&#x27;")
}

// Use the escaped text in your HTML
let safeHTML = "<p>\(escapeHTML(userInput))</p>"

Configuration

This detector does not need any configuration.

References