IANA Language Subtag Registry in SKOS

From W3C Wiki

This page contains a proposal to encode language codes of the IANA Language Subtag Registry in RDF. It is an implementation of Languages as RDF Resources.

Converting with awk and perl

#!/usr/bin/perl

=head1 NAME

registry2skos.pl - Convert IANA Language Subtag Registry to SKOS

=head1 SYNOPSIS

1. download the registry
  wget http://www.iana.org/assignments/language-subtag-registry

2. clean lines
  awk 'NR<3{next} /^[^ ]/{print L; L=$0} /^ /{L=L$0} END {print L}' language-subtag-registry > registry

3. convert by running this script

=head1 AUTHOR

Jakob Voss 

=head1 VERSION

0.1a - first draft and proof of concept

=cut

# General settings
my $VERSION = "0.1";

# Print header
print <<END;
<?xml version='1.0' encoding='ISO-8859-1'?>
<rdf:RDF xmlns:skos="http://www.w3.org/2004/02/skos/core#"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
END

# Parse registry
my %fields = ();

open REG, "registry" or die("no registry found!");
while (<REG>) {
  chomp;
  if ($_ =~ /^([A-Za-z-]+): (.*)$/) {
    if (defined $fields{$1}) { # multiple values
       if (ref $fields{$1} eq "ARRAY") {
           push (@{ $fields{$1}, $2 });
       } else {
          $fields{$1} = [$fields{$1}, $2];
       }
    } else { # single value
      $fields{$1} = $2;
    }
  } else {
    transform_code();
    %fields = ();
  }
}

# Print footer
print "</rdf:RDF>\n";

# actual transforming
sub transform_code {
  my $xml = "";

  my $description = $fields{"Description"};
  if (ref($description) eq "ARRAY") {
    foreach my $d (@{$description}) {
      $xml .= "  <skos:altLabel>$d</skos:altLabel>\n";
    }
  } else {
      $xml .= "  <skos:altLabel>$description</skos:altLabel>\n";
  }
  if ($fields{"Comment"}) {
    $xml .= "  <skos:publicNote>" . $fields{"Comment"} . "</skos:publicNote>\n";
  }
  if ($fields{"Added"}) {
    $xml .= "  <dc:date>" . $fields{"Added"} . "</dc:date>\n";
  }

  if ($fields{Type} eq "language") {
    if ($field{"Preferred-Value"} or $field{"Deprecated"} ) {
      # TODO: deprecated codes not implemented yet
    } else {
      my $subtag = $fields{"Subtag"};
      $xml = "<skos:Concept rdf:about='#$subtag'>\n" .
             "  <skos:prefLabel>$subtag</skos:prefLabel>\n" . # TODO: skos:notation
             $xml;

      if ($fields{"Suppress-Script"}) {
        # TODO: add an additional code and create an skos:exactMatch
      }
      $xml .= "</skos:Concept>\n";
      print $xml;
    }
  } elsif ($fields{Type} eq "redundant") {
    my $tag = $fields{"Tag"};
    $xml = "<skos:Concept rdf:about='#$tag'>\n" .
           "  <skos:prefLabel>$tag</skos:prefLabel>\n" . # TODO: skos:notation
           $xml;
    my $lang = $tag; 
    $lang =~ s/-.*$//;
    $xml .= "  <skos:broader rdf:resource='#$lang'/>\n"; # TODO: link to script if region specified
    $xml .= "</skos:Concept>\n";
    print $xml;
  } else {
    # TODO: Type grandfathered,region,script,variant
    foreach $k (keys %fields) {
      #print "$k: " . $fields{$k} . "\n";
    }
  }
}

__END__


Open tasks

  • add concepts for grandfathered, region, script, variant
  • add skos:broader and skos:narrower between redundant codes and language/region/script/variant
  • add concepts for Suppressed scripts and, mapped to with exactMatch or owl:sameAs
  • Finish SKOS Mapping so changes can be explicitely modeled
  • Clarify modelling of deprecated concepts in SKOS and add deprected language codes
  • Define official URIs
  • Model other language codes (for instance MARC Language Codes) and create mapping
  • Testing

Resources

  • Language tags in HTML and XML (W3C i18n)
  • The IANA Language Subtag Registry
  • Phillips, A., Davis, M., "Tags for Identifying Languages", RFC 4646 September 2006
  • Ewell, D., Ed., "Initial Language Subtag Registry", RFC 4645, September 2006.
  • Alvestrand, H., "Tags for the Identification of Languages", BCP 47, RFC 3066, January 2001.
  • Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995
  • International Organization for Standardization, "ISO 3166-1:1997. Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes", 1997.
  • International Organization for Standardization, "ISO 639-1:2002. Codes for the representation of names of languages -- Part 1: Alpha-2 code", 2002.
  • International Organization for Standardization, "ISO 639-2:1998. Codes for the representation of names of languages -- Part 2: Alpha-3 code, first edition", 1998.
  • Library of Congress, ISO 639-2 Registration Authority
  • SIL International, ISO 639-3 Registration Authority
  • International Organization for Standardization, "ISO 15924:2004. Information and documentation -- Codes for the representation of names of scripts", January 2004.
  • Statistics Division, United Nations, "Standard Country or Area Codes for Statistical Use", UN Standard Country or Area Codes for Statistical Use, Revision 4 (United Nations publication, Sales No. 98.XVII.9) June 1999
  • ANSI/NISO Z39.53, "Codes for the Representation of Languages for Information Interchange", 2001
  • Library of Congress, "MARC Code list for languages", 2003