Skip to content

Vague meaning author term #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mrjj opened this issue Mar 7, 2020 · 13 comments
Closed

Vague meaning author term #203

mrjj opened this issue Mar 7, 2020 · 13 comments
Assignees

Comments

@mrjj
Copy link

mrjj commented Mar 7, 2020

Web publication manifest have quite ambiguous author field.

Author term is usually used to mention someone's who've originated the written creative work.
Creator is the wider term. And for case when concept set and sub-set are defined on the same level of bib description its vague which field to use (or fill both records the same).
Lewis Carroll is both: author and creator of "Alice's Adventures in Wonderland" work and only example in WPM body is highlighting that author field should be used. With no clues about criteria of use or explicit difference between 'author' or 'creator'.

The short solution i see in using better definitions of author and creator fields e.g. from LC relator codes vocabulary (text and links below).

The best solution i see is to remove author field. Because even with clear criteria of difference there will be a second level of problem: how to interpret this fields during translation to other standards and forms (which field should be primary if you have display space to show only one and so on).

Below i providing related fragments from neighbour standards and some explanation about the things behind, and why it will be better to remove author term and not to remove creator.

Definitions from W3C Web Publication Manifest

  • author - The author of the publication. | One or more Person and/or Organization. | Array of Entities | author (CreativeWork)
  • creator - The creator of the publication. | One or more Person and/or Organization. | Array of Entities | creator (CreativeWork)

Definitions from DCMI

DCMI standard don't introduce theAuthor term.

FOAF

dct:creator/dct:agent and maker terms relationship note worth mentioning:

The Dublin Core specification provides term definitions that focus on issues of resource discovery, document description and related concepts useful for cultural heritage and digital library applications. FOAF can be used alongside any variants of Dublin Core, but works most effectively with the most modern Dublin Core terms namespace. Note that here we use the prefix 'dct:' to stand for the DC Terms namespace; however it is not unusual to see 'dc' also used.

  • dct:Agent - Dublin Core's notion of Agent is much like FOAF's; Dublin Core says "A resource that acts or has the power to act.", we say "things that do stuff". As nobody has provided a counter-example of something fitting one definition but not the other, we say here that foaf:Agent stands in an 'equivalent class' relationship to dct:Agent (and vice-versa).
  • dct:creator - The notion of 'creator' in the latest versions of Dublin Core matches FOAF's notion of 'maker'; based on their definitions, every pair of things that are related by one of those properties are also related by the other. We express this by saying that these properties stand in an 'equivalent property' relationship' to one another.

Definitions from MARC21 relator terms vocabulary and LC LD relator terms vocabulary

  • Author aut - A person, family, or organization responsible for creating a work that is primarily textual in content, regardless of media type (e.g., printed text, spoken word, electronic text, tactile text) or genre (e.g., poems, novels, screenplays, blogs). Use also for persons, etc., creating a new work by paraphrasing, rewriting, or adapting works by another creator such that the modification has substantially changed the nature and content of the original or changed the medium of expression
  • Creator cre: A person or organization responsible for the intellectual or artistic content of a resource.

By fact, i don't see that aut/cre codes are really common in bib catalogues, usually they omited and only name/auth code defined. And kind of relationship is defined by bibliographers for more specific relation than being originator. So even in the official LC example of MARC21 100 main name entry field you may see no authors or creators being defined explicitly.

I'm pretty sure that its not an occasional detail because bibliographer always holds part of authority institution responsibility and guarantees about single, as complete and correct as possible description being defined in scope of specific functional requirements and description standard. Any case of collision between two actual descriptions usually means that one of the versions should be considered outdated.

This simple authority control rule is main pillar of de-duplication of metadata being possible, and also preventing lot of holywars about which record is right, and question will be which of them is outdated and which should be corrected. The sane answer is the deletion of record with more recent (less background and sync history) control code and maybe its correction toward other record.

All this usually preventing cataloguers from dealing with ambiguous forms of definitions that have a chance to be catalogued differently, even if there are some in the description standard vocabulary.

BibFrame2

During drift of MARC family of standards toward LD its evolved to BibFrame2 (conceptually FOAF-like agent-activity-entity) model.

Bibframe have a complex work concept levels spine with tree levels. And previous version had two levels with different names. Its hard to say that BF2 is clear or stable but its "good enough to use". Due not being mature currently there is a lot of approaches to practical use and entity linking. But the core idea about contributors is something like following::

      (`Work`)        ->         (`Instance`)         ->       (`Item`)
         ^                             ^                           ^
         ^                 (`ProvisioningActivity`)                ^
         ^                             ^                           ^
  ('Contribution`)              ('Contribution`)            ('Contribution`)
      ^     ^                ^    ^          ^    ^             ^      ^
 (`Agent`) (`Role`) | (`Agent`)(`Role`) (`Agent`)(`Role`)  (`Agent`)(`Role`) 
                    |
 WPM author/creator |                      WPM contributor

TheAgent is playing Role (matching relator terms vocabulary) and through this making the Contribution.

Web Publication manifest author/creator fields will have form of Agents making contribution as well as Web Publication manifest contributors

  • Contribution - Agent and its Role in relation to the resource. Used with Work, Instance or Item
  • Role - Function played or provided by a contributor, e.g., author, illustrator, etc.
  • Agent - Entity having a role in a resource, such as a person or organization.
  • ProvisionActivity - Information about the agent or place relating to the publication, printing, distribution, issue, release, or production of a resource.

I do not see BF2 as some perfect extension for WPM because WPM seems to be initially flat by design. In opposite BF2 is designed with large temporal dimension and able to describe historical process that came up with object of culture possible to interact directly. BF2 records web is definitely hostile to any table-form representation and actually (e.g. by exclusion of shelf numbers and other storage identifiers space) do not tend to describe final layer of items or digital objects. And even maintainers offical converors from other standards just ignoring information like physical items storage marks or digital publication containers processing details.

In short: BF2 don't designed to provide both isolated and understandable records and merely targeted publishers

AACR/ISFB/FRBR (It's deep domain, so i'm not providing detailed links)

  • "author" term is used merely. And usually its the "Work creator"

This standards families are the real ground for all standards mentioned above and DCMI may be considered as the robust and consolidated shortcut down to the bibliographic domain regulation and experience.

@mrjj mrjj changed the title Vague meaning and level of author term Vague meaning author term Mar 7, 2020
@iherman
Copy link
Member

iherman commented Mar 7, 2020

@mrjj, this is a similar issue to #202, with a similar answer. The choice of the WG was to reuse, as much as possible, the schema.org definitions. This is what we did. In essence, the Working Group did not want to engage into defining yet another vocabulary for such terms, and we prefer let the community do that where such words are already done. Schema.org being one of the most widely used vocabularies, also at the core of search engines, this was a pragmatic choice.

You are right that the various vocabularies you refer provide a more detailed definitions of authorship, and I can also see that some applications may want to use those instead. This is possible: the manifest is not a closed entity and it is perfectly possible to use those terms instead. See

https://www.w3.org/TR/pub-manifest/#extensibility-manifest-properties

for more details.

@mattgarrish
Copy link
Member

I think there's an unintended issue here that's the opposite of what is being asked. By the schema.org definition, creator is synonymous with author:

creator | The creator/author of this CreativeWork. This is the same as the Author property for CreativeWork.

That part of the definition didn't make it into the specification but seems like an important piece to make people aware of.

@mrjj
Copy link
Author

mrjj commented Mar 7, 2020

Group did not want to engage into defining yet another vocabulary for such terms,

Yes exactly for this reason i offer to use existing and widely adopted domain vocabulary for clarification this terms and every time prefer to use whatever standard but use it without violations and coordinating found problems with WG first of all by providing practical use-cases.

I understand that your primary goal is alignment between user-agents variety and Schema.org model keeping status quo and this seems sane.

You say that compatibility with knowledge graphs behind SE is important. Yes, no doubts about.

So, lets suppose i'm author. And i'm interested to raise SE rank of my publication. and this is a common use case. To which field i should place author's name for this? I may suppose you have no idea, and I have no idea, SE maintainers have no idea as well as initial data model creators.

As I've seen it large SE development teams mostly being young STEM-oriented talented guys with huge burning stack of burning business features and very few ideas about how domains outside IT are working. And they just don't have time to discuss any details of entity model they are getting from more and more data pipes.

Support of some new standard for data exchange is very common type of burning feature. After reading the standard (from the point of implementation cost/time reduction) everything that was not clear getting resolved ad-hoc.

Then all other agents of data exchange getting poisoned by this ad-hoc and after some time they will get formal internal specification form both, model and logic. Because for SE its important to explain how ranking logic works to other players. All this forming interexchange core between majors. After reaching internal interests parity majors will open exchange data with mid/small level players using same model but documented for third parties. Early adopters are often STEM guys of the same kind but making domain-specific business or integrating major's solution. And they have even more burning business features and seeing sane strategy not in gathering own data banks but using ready from majors.

Best industry specialists will be involved to govern and supervise exchange formats and align them with interests of market players and wide public and prevent growth of exchange formats variety.

Then this negative feedback loop will mechanically repeat or with new kind of data market offer will start new one.

From my side I offer IMO very delicate ways not to feed this process. Supposing that You may provide to standard users some comments related to the logic behind standard and some rules of thumb to save adopters time.

These properties can be used in a manifest as this document defines only the minimal set of manifest items (see § 4.7.3.2 Additional Manifest Properties).

I supposed that using only one of the two synonymous (actually not) fields for core fields is a way toward this this goal.
The reason I've found and organised list of domain examples abode is developers asking me trivial simple question "To which field i should place names?" and I don't have any fast answer. Its a key field and this problem may be common.

I think there's an unintended issue here that's the opposite of what is being asked. By the schema.org definition, creator is synonymous with author:

creator | The creator/author of this CreativeWork. This is the same as the Author property for CreativeWork.

That part of the definition didn't make it into the specification but seems like an important piece to make people aware of.

I think just explanations may help not affecting any technical compatibility. Maybe its an option to introduce explicit fields display priority for the core fields. It will help to make tradeoffs related to limited screen space predictable way. As well as sort logic for the publications lists.

I expecting more underwater stones with this quasi-synonyms. All ancient stones that bibliographers already met before they omitted author term + some new stones from the age of digital transformation.

@mattgarrish
Copy link
Member

I think just explanations may help not affecting any technical compatibility.

Perhaps not, but the practicalities of schema.org can't be changed here. It's a vocabulary that does arguably suffer a bit from a lack of central control over the design of the types, but that messinesss is also the reality of expressing web metadata.

In neither specification was it our intent to press for specific or restrictive metadata practices, as we wanted to adapt to what people already do on the web. Taking out author isn't solving the problem, for example, as you cannot stop someone from using a term that is a part of schema.org (or this group has shown no interest in excluding valid terms). If you really want the term removed, you need to take your arguments to the schema.org folks.

And, unfortunately, my experience is that all the explanations in the world aren't going to make people author metadata "correctly", either. Creator is often confused with contributor, for example. It's not uncommon to find the names of people who worked on the digital format listed as creators in EPUB. We can quickly get bogged down in all kinds of minutiae if we try to design bibliographic records, and the reality is that many people will still do whatever they intuitively think is right.

It's also a common view in these parts that if proper bibliographic records are needed, they are created and stored separately by publishers. The metadata within a publication is more specifically geared toward the user agent with an emphasis on simplicity and only defining what can be justified as usable by a user agent. We're not out to displace ONIX, MARC, etc.

Maybe its an option to introduce explicit fields display priority for the core fields.

I've had this concern in the past, but you also have to understand that the publication manifest specification is not web publications. It is a common base from which specific implementations, like Audiobooks, are intended to be designed (that specification recommends the use of author, for example, and is silent on creators). I'm not sure if we can prioritize display at this level, as @iherman has already noted, implementations might use different metadata or could prioritize display differently.

@mrjj
Copy link
Author

mrjj commented Mar 9, 2020

@mattgarrish thank you for the so deep level of understanding.

And, unfortunately, my experience is that all the explanations in the world aren't going to make people author metadata "correctly", either. Creator is often confused with contributor, for example. It's not uncommon to find the names of people who worked on the digital format listed as creators in EPUB.

I have to sit on both chairs too, data vendors may have fine grained ontology frameworks and concepts (not true) but they are OK for internal exchange process but not so useful when you are trying to help people to gain access to publication, on casual language level they can suppose completely other meaning.
I really don't want to play ontology game myself, and spent lot of time designing catalogue (MDM) system that providing no own data model at all, just a way to keep records, bib standards, cross-walks and expert collaboration in single place, whenever what industry use.

Despitely its different from publishing activity case, when we have companies who is developing software clients and they don't want to go into domain, they want clear description of format, forms, display view and so on. As well as publishers, and considering that object of publishing is federal-level legal deposit and clients are targeted on 80%-90% of local active mobile device base. There are a lot of side. You don't want to play definition game, i don't, but Readium did and did it well. But now their activity getting merged with W3C and web-pub specs gaining more and more distance from what Readium have maintained with concern-level standards. Its not a big deal to rename fields technically, but its hard to align all parties involved in publication distribution. If Web Publication manifest standard is supposing everything to be on domain-level so what the standard final purpose being compared to the standard of the thin wrap around JSON-LD/RDF.

Well, during only two quarters it happened about 3 flagman standards of web and mobile platforms publishing. Currently we trying to support all on authoring level. But it will be much harder to do on the level of mature software with long-running vendor guarantees behind, e.g. Adobe InDesign is currently is merely supporting small subset of EPUB 3.0, and this is quite expected case.

And its not too easy to align all this with integrity solutions based on JSON schema 2019-09 that integrates entity schema with its semantic linkage constraints. And its not as hard to be just early adopter than to have a history publishing and library domain standards behind. And they have a ready answers and (whenever how) working software ecosystem. But they are proven to be unfriendly to end-user, and this level of requests and understanding. Its very hard to came up with something that rejecting core industry experience. And its not surprise that a lot of domain-level experts will be involved by any major adopter.

And when you are coming up with everything being dirty against domain in core definitions and a long story about that its hot because you have no idea about domain it will be very bad introduction

I really hope that describing what is the major problems of being dpub/webpub standards adopter makes some sense and its not a local-level things. Or you have to extend not authoring/client side of LD integration platforms following Europeana, and its a big bunch of resources in completely different direction. For now we will just try to keep track on changes and align everything that changes and will try to reach proposal level through OCLC as one of domain coordinators and major bib metadata provider through WorldCat but its hard for me to see W3C only as tech-level side and coordination site for all this.

Its out of current topic, i've provided some highlights about our experience related audiobooks w3c/wpub#465 in our quite massive use case audiobooks metadata happens to be an a11y extension level thing, because they are mostly not stand-alone creative implementation but product of automatic representation through very advanced voicetech stack from local major (schema.org core maintainer as well). And on practice it happens to be not too far from visual render directives and not affecting metadata related to the description of core creative work structure and content

Anyway, thank you a lot for your time and understanding!

@llemeurfr
Copy link
Contributor

A Pub Manifest can contain any schema.org property not defined in the spec (section 1.4).
This specification therefore selects some properties already defined in schema.org, particularly relevant to publishing.

Selecting both author and creator, which have the same meaning in schema.org, is therefore a bad move; reading systems will display one or the other (some will treat them as equivalent but this is only luck) and creators of Pub Manifest will insert one or the other (sometime both, is which cas some reading systems will display duplicates).

In a standard, having two ways to do the same thing is always a receipe for disaster.

I'm therefore suggesting to remove creator from the RC, before it becomes a REC.

@mattgarrish
Copy link
Member

Schema.org seems to give primacy to author over creator, as it only notes creator is a synonym for author, so works for me. But do we need to run this by the WG before making a change? It's not an obvious error of the specification.

@iherman
Copy link
Member

iherman commented Apr 16, 2020 via email

@llemeurfr
Copy link
Contributor

It's not an obvious error of the specification.

No it is not a bug, it's only a path to ambiguity during its deployment. Which is something to avoid.

@iherman
Copy link
Member

iherman commented Apr 16, 2020

Right. So we may have to put a note into the document making it clear(er) when one should be used over the other (noting that schema.org does not make a difference between the two).

IMHO, this is actually a schema.org bug. Just making these two terms "identical" is some sort of a semantic bug.

@llemeurfr
Copy link
Contributor

I don't follow you here Ivan. Why not simply delete it from this set of recommended properties in our spec? People can still use creator if they find a proper reason to do so, without the spec editor trying to find a good reason to use both when he doesn't know what this reason can be.

@mattgarrish
Copy link
Member

mattgarrish commented Apr 16, 2020

After all, a creator and an author are different notions, aren't they?

Sure, "creator" is just an ambiguous designation for someone who played some significant role in the creation of the content. It's the default when you can't say anything more meaningful.

It's also a big reason why epub authoring metadata is complicated to process for reading systems, as we've had to find ways to inflect more meaning onto the dc:creator element.

But what's worrisome in this case is that the equivalence has already been set up by schema.org. It might not be wise to try and alter that. Otherwise, your metadata will come out meaning one thing for a search engine and possibly something else for a reading system and that's not good situation.

I've kind of changed my position on removing it, since removing it doesn't really address the problem. The property is valid whether listed or not, so it just leaves the confusion unaddressed if we drop it.

Like schema.org, we should probably clearly note that they are synonyms and even go so far as to state that during processing creators will be translated/appended to author (i.e., author is preferred and when present gets highest priority). That way there's no ambiguity after processing, at least.

@mattgarrish
Copy link
Member

Closing this issue as we added a note earlier about the terms being synonyms with preference being given to author. That's all I can see we can do unless or until schema.org changes their approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants