Spec Discussion: Small Website & Docs Changes
Hey guys! Let's dive into some suggestions for tweaking the specification based on the discussion in issue #173. This is all about making things clearer and more practical, so your input is super valuable.
Is there an existing issue for this?
Yep, I've already searched through the existing issues to make sure we're not doubling up on anything. ☑️
Description
This is a follow-up to #173, where we're looking at a list of "small" changes that I think would be beneficial to the requirements. Most of these are clarifications, and some are items that I think we can safely remove as requirements based on our previous discussions (hence, unchecking the boxes in the table for them).
For context, I'll be referencing the Hugging Face docs on dataset cards (which I'll call "HF docs" from here on out) quite a bit.
Also, keep an eye out for other sub-issues, as I'll be breaking down these recommendations further.
Affected Areas
These changes will primarily affect:
- Documentation files (docs/) ☑️
- Website content (website/) ☑️
Proposed Changes
Let's break down the specific changes I'm proposing:
1. language-details
Field
This is where things get interesting. I've got two main suggestions here:
a. Standardizing Language Tags
The "HF docs" suggest using language tags from https://en.wikipedia.org/wiki/IETF_language_tag, like en-US
. That's a solid approach, and we should totally recommend it. However, when I was putting together the static catalog, I noticed something: this field might not be as crucial as we thought. Almost every HF dataset we cataloged had en-US
as the sole value for this field! This indicates a potential redundancy or a need to rethink how we use this field effectively.
The language-details
field, while intended to provide specific language information about a dataset, has shown limited practical utility in our cataloging efforts. The overwhelming prevalence of en-US
as the sole value across Hugging Face datasets suggests that the current implementation might not be capturing the nuanced linguistic diversity that exists within datasets. This uniformity raises questions about the field's ability to differentiate datasets based on language and its overall contribution to dataset discoverability and usability. To enhance the value of this field, we might explore more granular language tagging, consider incorporating dialectal variations, or integrate information about the linguistic composition of the dataset's content. Furthermore, revisiting the requirements and guidelines for this field can help ensure that it aligns with the actual needs of dataset users and contributors, potentially leading to a more meaningful and informative representation of language details within our catalog.