[css-syntax] custom property names too permissive, require namespacing rules #7129

aphillips · 2022-03-10T17:32:08Z

Originally raised on CSS Variables, but later discussion concluded the best fix is to change CSS Syntax. Original post below:

https://www.w3.org/TR/css-variables-1/#defining-variables

A custom property is any property whose name starts with two dashes (U+002D HYPHEN-MINUS), like --foo. The production corresponds to this: it’s defined as any (a valid identifier that starts with two dashes), except -- itself, which is reserved for future use by CSS. Custom properties are solely for use by authors and users; CSS will never give them a meaning beyond what is presented here.

The above text defines the custom property name as "any valid identifier". Tracing that definition back to CSS Values and thence to ident token, we find that the name can contain any Unicode code point > U+0080. This includes various special forms of whitespace as well as potential problem characters, such as bidi controls (such as might cause "Trojan Source" attacks). Namespacing is definitely a complicated problem: the I18N WG doesn't want groups to cherry-pick characters (thereby excluding certain languages from using the feature).

Most programming languages attempt to address this by adopting some form of restriction for variable names such as those found in UAX31 Unicode Identifier and Pattern Syntax. In JavaScript, for example, the definition looks like the one found here. CSS should make similar restrictions on property names (values can remain unrestricted).

The text was updated successfully, but these errors were encountered:

tabatkins · 2022-03-10T18:10:42Z

Yes, custom property names can contain literally any non-ASCII character.

If necessary, I'm happy to restrict this, but I question why it would be necessary to do this for property names, but okay for property values to still be fully unrestricted?

aphillips · 2022-03-10T18:35:25Z

Property names are used in CSS "code" and have to be be parsed, matched, and otherwise referenced. Abusive names can cause spoofing problems (even though the underlying code point sequence is still just some integers and the parser may not care). For example, is --\0301 a variable reference? Or an error? (using U+0301 COMBINING ACUTE ACCENT as an example of a combining mark at the start of a name)

Property values are data and can include natural language text (as well as, well, any character data, including junk). While the value space might be limited by applications in different ways, there don't appear to be any requirements to do so here. In fact, your Spec goes out of its way to highlight this fact:

Because custom properties can contain anything, there is no general way to know how to interpret what’s inside of them (until they’re substituted into a known property with var()). Rather than have them partially resolve in some cases but not others, they’re left completely unresolved; they’re a bare stream of CSS tokens interspersed with var() functions.

tabatkins · 2022-03-10T18:44:51Z

Property values are used in CSS just as much as property names, tho. We can't interpret custom property values in the custom property, but we do as soon as they're substituted into a non-custom property, or if the custom property is registered with a grammar.

Moreso, tho, the value type, which operates on the same rules, is used in a number of places in CSS, such as counter-style names, font names, etc. which seem to be in a similar semantic space.

aphillips · 2022-03-10T18:58:11Z

Yes, it occurred to me that this might turn out to be a gap in CSS Syntax (which might be a serious "ouch" and difficult to do something about).

Since one of the things the property value can contain is a string literal, one probably can't apply UAX31-like rules just generically to the value (i.e. in CSS Variables)

tabatkins · 2022-03-10T19:03:14Z

Right, removing the potential from strings def seems out (at minimum, they def shouldn't be restricted to the Identifier production from Unicode), but I think I didn't make my original point clearly enough - this restriction should apply to all custom identifiers, not just property names, right?

aphillips · 2022-03-10T19:10:21Z

That's right.

tabatkins · 2022-03-10T19:16:49Z

Okay, I'm gonna retag this to Syntax, then, because we should handle the restriction at that level.

tabatkins · 2022-03-10T19:29:07Z

Agenda+ to discuss restricting the allowed codepoints in an ident sequence (used in keywords, function names, dimension units, selectors, property names, etc).

Possible options:

use the UAX31 categories, matching JS identifiers
use the post-ASCII part of HTML's custom element name restrictions, to ensure that selectors can match all custom element names without needing escaping

Then there are subsequent questions. First, what should we do about characters so restricted that are used in an identifier sequence?

Treat them like we do lone surrogates, and replace with U+FFFD
Disallow them from the production at all, so an identifier valid today with a restricted character in the middle would instead become two identifiers with a DELIM token containing the restricted character. (I'm inclined toward this, as it would cause most usage of the restricted character to become invalid at parse-time, such as in custom property names, and thus would discourage its usage.)

Second, should we allow escapes to represent the restricted character in identifiers?

JS, as far as I can tell, doesn't allow it. Their restriction is rather broad, tho - it's intended to make it so that no identifier that would be illegal to write literally can be written with escapes (in other words, they disallow an ident that could only be written by using escapes) - this disallows things like escaping a dash or period, which CSS has historically allowed and probably can't restrict
Just let it work. I'm inclined to go this way.

svgeesus · 2022-03-15T19:02:27Z

Since this is now being solved at the CSS Syntax level, which is the correct way to do it, untagging from CSS Variables

tabatkins · 2022-03-16T16:14:01Z

Note that the latter two of the three identifiers in my popular tweet would be invalid under the JS rules:

.ಠ_ಠ { --（╯°□°）╯: ︵┻━┻; } is valid CSS.

I'f I'm reading correctly, under the HTML rules the middle one --（╯°□°）╯ would be invalid due to the degree symbols, but the other two are valid.

Not that this is required to be supported, just noting the effects. ^_^

css-meeting-bot · 2022-03-16T16:37:08Z

The CSS Working Group just discussed Custom property names too permissive, and agreed to the following:

RESOLVED: Use HTML restrictions for custom idents
RESOLVED: illegal characters in an ident can be escaped
RESOLVED: Invalid ident characters are treated as DELIM tokens

The full IRC log of that discussion

Topic: Custom property names too permissive
github: https://github.com//issues/7129
TabAtkins: i18nWG raised issue about custom idents, which allow any Unicode codepoint above a certain codepoint
TabAtkins: There are some concerns about e.g. bidi characters corrupting the display of the code
TabAtkins: Also argument for consistency in what characters allowed across languages
TabAtkins: JS follows UAX?? rules for characters allowed in idents
TabAtkins: HTML allows a different but largely compatible range of characters
TabAtkins: In one of my Tweets, I showed off using weird Unicode rules
TabAtkins: e.g. different emoji are valid or invalid
TabAtkins: I agree with i18n feedback, reasonable to partially restrict these
TabAtkins: e.g. no reason to allow bidi override chars in CSS idents
TabAtkins: so I suggest adopting either HTML rules or JS rules
q?
TabAtkins: don't have a strong opinion on which to go for
TabAtkins: Otherwise I'd go with HTML rules by default
Scribenick: emilio
fantasai: I think this is fairly reasonable, but I don't know the differences between the rules so I don't have an opinion on those yet
TabAtkins: JS rules are a bit more strict, they disallow chars that look like punctuation
TabAtkins: HTML gives exact codepoint ranges
TabAtkins: Reason I'd go with HTML is to guarantee being able to write selectors for custom elements, without ever having to escape
ack fantasai
fantasai: That sounds reasonable, let's go with that
Rossen_: Makes sense, any downsides to it?
TabAtkins: Any change to make more restrictive, could potentially make some stylesheets invalid
TabAtkins: potentially breaking code that works
Rossen_: And with HTML rules we'd have fewer breakage
Rossen_: seems like path of least destruction
Rossen_: Anyone would like to argue against the change entirely?
Rossen_: If not any objections?
Rossen_: Taking the silence as a no
RESOLVED: Use HTML restrictions for custom idents
TabAtkins: Got 2 sub-issues
TabAtkins: One is whether to allow illegal characters to be escaped in an identifier
TabAtkins: JS doesn't allow that, you can escape for readability but not to avoid the identifier restrictions
TabAtkins: but CSS has traditionally always allowed escapes for everything, so don't see a strong reason to disallow
+1 from us too
TabAtkins: So I would prefer to go with illegal chars can be escaped
fantasai: I strongly agree with that
Rossen_: Any objections for allowing illegal characters to be escaped in an ident?
RESOLVED: illegal characters in an ident can be escaped
TabAtkins: Next question is how do we handle the illegal characters
That doesn't allow nulls in idents, does it?
TabAtkins: Do we censor them into e.g. U+FFFD
TabAtkins: or drop them entirely?
TabAtkins: I'd prefer to drop them, because it would more clearly result in invalid code
TabAtkins: so if we allow to work but censored it wouldn't prevent use in source text, which was the goal of i18n
TabAtkins: so would prefer to exclude from the ident production
+1
+1 TabAtkins
Rossen_: [missed]
TabAtkins: No, would not be changing existing rules for censoring rules. Currently lone surrogates etc. do that
TabAtkins: Those are in there for UTF-8 well-formedness and C compatibility
TabAtkins: They have a reason to be censored out at technical low level
TabAtkins: these restrictions are for human reasons, so would restrict differently
ack fantasai
fantasai: So should we resolve that they would make the production invalid? (That's what was proposed right?)
--（╯°□°）╯
TabAtkins: yes
TabAtkins: if you put this ^ as a custom property name, the degree sign is not a valid character
TabAtkins: so it would make an ident, a delim, a parenthesis, and a ???
TabAtkins: That's definitely not an ident, because it's multiple tokens not an ident token
Is there a practical use case for doing something like that? Seems more like a developer having fun rather than good quality code.
TabAtkins: Proposed resolution is that it would break into multiple tokens
fantasai: What kind of token are these invalid characters going to be?
TabAtkins: DELIMs, one codepoint at a time
TabAtkins: Characters without a specific role are generally handled as DELIM
TabAtkins: and we only use certain DELIMs in certain places
the degree sign isn't a valid ident char under the HTML rules, so this would produce an ident, a delim containing the degree sign, an ident, a delim, and finally an ident
RESOLVED: Invalid ident characters are treated as DELIM tokens
present-

tabatkins · 2022-03-16T16:41:47Z

The HTML allowed chars are:

"-" | "." | [0-9] | "_" | [a-z] | #xB7 | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x203F-#x2040] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

We'd continue to disallow . and allow [A-Z], of course, but for all the characters >= 0x80 we'll match this list.

svgeesus · 2022-03-16T16:45:30Z

@aphillips does that resolution address your concerns?

aphillips · 2022-03-16T17:15:55Z

Programming languages such as JS and Java that allow non-ASCII variable names with character limits usually have different restrictions for the initial character. Most notably they forbid combining marks. They sometimes exclude other values (such as bidi controls, although those are excluded above). I think using the HTML restrictions is a realistic solution for CSS for the reasons given above.

We might need a note about combining mark handling at the start of a token. HTML handles this by requiring an alpha char ([a-zA-Z] in your case) at the start (HTML also treats combining characters as non-combining when parsing, e.g. class="́" contains a class name consisting solely of a combining mark. (I used an entity for visibility). As long as the processor can't be fooled, I think you're good to go?

tabatkins · 2022-03-16T17:32:24Z

Ah, hm, indeed. HTML gets to avoid the first-character problem. CSS does have special rules for the first character of an ident as well, but they're only different in the ASCII range (preventing idents from being number-lookalikes).

But if the concern is just about combining characters at the start, that'll be fine mechanically; the CSS parser still just handles codepoints individually and doesn't care about combining in any way. So you could select the class you mention without worry, by putting that combining char after a period.

That said, I'm fine with further first-char restrictions if necessary. Unless you request otherwise tho, @aphillips , I'll assume that the existing rules should be fine and use the same non-ASCII allowed characters for both initial and non-initial chars.

aphillips · 2022-03-16T18:13:55Z

As long as the CSS parser doesn't care, then things should work. Content authors will need to be advised about the spoof/abuse potential when viewing CSS files as text somewhere (you may even already have such a note??)

tabatkins · 2022-03-16T18:24:41Z

Don't have such a note, but I'm look over some of the verbage used elsewhere for that issue and add one.

…s to the same list that HTML allows in custom element names. #7129

tabatkins · 2022-03-23T17:57:42Z

Okay, restriction applied, and I added a significant note to that section.

I'll need to add and/or tweak some tests for this.

mathiasbynens · 2022-03-28T12:03:08Z

CC’ing @markusicu @macchiati (Unicode “Trojan Source” working group) as FYI

tabatkins · 2022-03-29T19:23:46Z

Note that I'm currently linking to a Rust-lang blog post about the trojan source thing; if there's a better "official source" about it from Unicode I'd be happy to switch the reference over.

aphillips · 2022-03-29T21:39:18Z

Here's Unicode's announcement, which has a link to @macchiati et al's doc about the topic.

faceless2 · 2022-04-22T11:28:40Z

Re testcases, we've just patched this in and are seeing changes with some old tests:

css/CSS2/i18n/syndata/character-encoding-001.xht
css/CSS2/i18n/syndata/character-encoding-004.xht
css/CSS2/i18n/syndata/character-encoding-007.xht
css/CSS2/i18n/syndata/character-encoding-010.xht
css/CSS2/i18n/syndata/character-encoding-026.xht
css/CSS2/i18n/syndata/character-encoding-027.xht
css/CSS2/i18n/syndata/character-encoding-028.xht
css/CSS2/i18n/syndata/character-encoding-029.xht
css/CSS2/syntax/characters-0080-009F-001.xht

cdoublev · 2022-06-07T12:46:04Z

Shouldn't serialize an identifier be modified accordingly?

- If the character is not handled by one of the above rules and is greater than or equal to U+0080, is "-" (U+002D) or "_" (U+005F), or is in one of the ranges [0-9] (U+0030 to U+0039), [A-Z] (U+0041 to U+005A), or \[a-z] (U+0061 to U+007A), then the character itself.
+ If the character is not handled by one of the above rules and is a [non-ASCII ident code point](https://drafts.csswg.org/css-syntax-3/#non-ascii-ident-code-point), "-" (U+002D), "_" (U+005F), or is in one of the ranges [0-9] (U+0030 to U+0039), [A-Z] (U+0041 to U+005A), or \[a-z] (U+0061 to U+007A), then the character itself.

tabatkins · 2022-06-17T22:28:57Z

Yes, it should be.

romainmenke · 2022-11-05T19:13:32Z

Is it correct that the ranges are inclusive?
Other definitions are explicit about this.

lowercase letter has this definition :

A code point between U+0061 LATIN SMALL LETTER A (a) and U+007A LATIN SMALL LETTER Z (z) inclusive.

Can we change the current text to :

non-ASCII ident code point
A code point whose value is any of:
U+00B7
between U+00C0 and U+00D6 inclusive
between U+00D8 and U+00F6 inclusive
between U+00F8 and U+037D inclusive
between U+037F and U+1FFF inclusive
U+200C
U+200D
U+203F
U+2040
between U+2070 and U+218F inclusive
between U+2C00 and U+2FEF inclusive
between U+3001 and U+D7FF inclusive
between U+F900 and U+FDCF inclusive
between U+FDF0 and U+FFFD inclusive
greater than or equal to U+10000

Update :

I forgot I asked this here and asked it again in #8861 (comment)

This has been answered and needed edits were made.

aphillips added css-variables-1 Current Work i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. labels Mar 10, 2022

aphillips mentioned this issue Mar 10, 2022

custom property names too permissive, require namespacing rules w3c/i18n-activity#1487

Open

tabatkins added css-syntax-3 Agenda+ labels Mar 10, 2022

svgeesus removed the css-variables-1 Current Work label Mar 15, 2022

svgeesus changed the title ~~[css-variables] custom property names too permissive, require namespacing rules~~ [css-syntax] custom property names too permissive, require namespacing rules Mar 15, 2022

svgeesus mentioned this issue Mar 15, 2022

[css-variables-1] Horizontal Review #6808

Closed

5 tasks

css-meeting-bot removed the Agenda+ label Mar 16, 2022

mozilla-apprentice mentioned this issue Mar 16, 2022

[css-syntax] custom property names too permissive, require namespacing rules mozilla/wg-decisions#755

Closed

tabatkins added the Needs Edits label Mar 16, 2022

tabatkins added a commit that referenced this issue Mar 23, 2022

[css-syntax-3] Per WG resolution, restrict non-ASCII ident code point…

f1792dd

…s to the same list that HTML allows in custom element names. #7129

tabatkins added Needs Testcase (WPT) and removed Needs Edits labels Mar 23, 2022

tabatkins added the cssom-1 Current Work label Jun 17, 2022

fantasai added the Needs Edits label Jan 17, 2023

tabatkins removed the Needs Edits label May 22, 2023

tabatkins mentioned this issue May 22, 2023

Test the definition of non-ASCII codepoint. web-platform-tests/wpt#40147

Merged

estelle mentioned this issue Aug 22, 2023

semi-colon, not colon mdn/content#28686

Merged

[css-syntax] custom property names too permissive, require namespacing rules #7129

[css-syntax] custom property names too permissive, require namespacing rules #7129

Comments

aphillips commented Mar 10, 2022 • edited by svgeesus Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tabatkins commented Mar 10, 2022

Uh oh!

aphillips commented Mar 10, 2022

Uh oh!

tabatkins commented Mar 10, 2022

Uh oh!

aphillips commented Mar 10, 2022

Uh oh!

tabatkins commented Mar 10, 2022

Uh oh!

aphillips commented Mar 10, 2022

Uh oh!

tabatkins commented Mar 10, 2022

Uh oh!

tabatkins commented Mar 10, 2022

Uh oh!

svgeesus commented Mar 15, 2022

Uh oh!

tabatkins commented Mar 16, 2022 • edited by svgeesus Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

css-meeting-bot commented Mar 16, 2022

Uh oh!

tabatkins commented Mar 16, 2022

Uh oh!

svgeesus commented Mar 16, 2022

Uh oh!

aphillips commented Mar 16, 2022

Uh oh!

tabatkins commented Mar 16, 2022

Uh oh!

aphillips commented Mar 16, 2022

Uh oh!

tabatkins commented Mar 16, 2022

Uh oh!

tabatkins commented Mar 23, 2022

Uh oh!

mathiasbynens commented Mar 28, 2022

Uh oh!

tabatkins commented Mar 29, 2022

Uh oh!

aphillips commented Mar 29, 2022

Uh oh!

faceless2 commented Apr 22, 2022

Uh oh!

cdoublev commented Jun 7, 2022

Uh oh!

tabatkins commented Jun 17, 2022

Uh oh!

romainmenke commented Nov 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aphillips commented Mar 10, 2022 •

edited by svgeesus

Loading

tabatkins commented Mar 16, 2022 •

edited by svgeesus

Loading

romainmenke commented Nov 5, 2022 •

edited

Loading