IANA Language Subtag Registry in SKOS
Appearance
This page contains a proposal to encode language codes of the IANA Language Subtag Registry in RDF. It is an implementation of Languages as RDF Resources.
Converting with awk and perl
#!/usr/bin/perl
=head1 NAME
registry2skos.pl - Convert IANA Language Subtag Registry to SKOS
=head1 SYNOPSIS
1. download the registry
wget http://www.iana.org/assignments/language-subtag-registry
2. clean lines
awk 'NR<3{next} /^[^ ]/{print L; L=$0} /^ /{L=L$0} END {print L}' language-subtag-registry > registry
3. convert by running this script
=head1 AUTHOR
Jakob Voss
=head1 VERSION
0.1a - first draft and proof of concept
=cut
# General settings
my $VERSION = "0.1";
# Print header
print <<END;
<?xml version='1.0' encoding='ISO-8859-1'?>
<rdf:RDF xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
END
# Parse registry
my %fields = ();
open REG, "registry" or die("no registry found!");
while (<REG>) {
chomp;
if ($_ =~ /^([A-Za-z-]+): (.*)$/) {
if (defined $fields{$1}) { # multiple values
if (ref $fields{$1} eq "ARRAY") {
push (@{ $fields{$1}, $2 });
} else {
$fields{$1} = [$fields{$1}, $2];
}
} else { # single value
$fields{$1} = $2;
}
} else {
transform_code();
%fields = ();
}
}
# Print footer
print "</rdf:RDF>\n";
# actual transforming
sub transform_code {
my $xml = "";
my $description = $fields{"Description"};
if (ref($description) eq "ARRAY") {
foreach my $d (@{$description}) {
$xml .= " <skos:altLabel>$d</skos:altLabel>\n";
}
} else {
$xml .= " <skos:altLabel>$description</skos:altLabel>\n";
}
if ($fields{"Comment"}) {
$xml .= " <skos:publicNote>" . $fields{"Comment"} . "</skos:publicNote>\n";
}
if ($fields{"Added"}) {
$xml .= " <dc:date>" . $fields{"Added"} . "</dc:date>\n";
}
if ($fields{Type} eq "language") {
if ($field{"Preferred-Value"} or $field{"Deprecated"} ) {
# TODO: deprecated codes not implemented yet
} else {
my $subtag = $fields{"Subtag"};
$xml = "<skos:Concept rdf:about='#$subtag'>\n" .
" <skos:prefLabel>$subtag</skos:prefLabel>\n" . # TODO: skos:notation
$xml;
if ($fields{"Suppress-Script"}) {
# TODO: add an additional code and create an skos:exactMatch
}
$xml .= "</skos:Concept>\n";
print $xml;
}
} elsif ($fields{Type} eq "redundant") {
my $tag = $fields{"Tag"};
$xml = "<skos:Concept rdf:about='#$tag'>\n" .
" <skos:prefLabel>$tag</skos:prefLabel>\n" . # TODO: skos:notation
$xml;
my $lang = $tag;
$lang =~ s/-.*$//;
$xml .= " <skos:broader rdf:resource='#$lang'/>\n"; # TODO: link to script if region specified
$xml .= "</skos:Concept>\n";
print $xml;
} else {
# TODO: Type grandfathered,region,script,variant
foreach $k (keys %fields) {
#print "$k: " . $fields{$k} . "\n";
}
}
}
__END__
Open tasks
- add concepts for grandfathered, region, script, variant
- add skos:broader and skos:narrower between redundant codes and language/region/script/variant
- add concepts for Suppressed scripts and, mapped to with exactMatch or owl:sameAs
- Finish SKOS Mapping so changes can be explicitely modeled
- Clarify modelling of deprecated concepts in SKOS and add deprected language codes
- Define official URIs
- Model other language codes (for instance MARC Language Codes) and create mapping
- Testing
Resources
- Language tags in HTML and XML (W3C i18n)
- The IANA Language Subtag Registry
- Phillips, A., Davis, M., "Tags for Identifying Languages", RFC 4646 September 2006
- Ewell, D., Ed., "Initial Language Subtag Registry", RFC 4645, September 2006.
- Alvestrand, H., "Tags for the Identification of Languages", BCP 47, RFC 3066, January 2001.
- Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995
- International Organization for Standardization, "ISO 3166-1:1997. Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes", 1997.
- International Organization for Standardization, "ISO 639-1:2002. Codes for the representation of names of languages -- Part 1: Alpha-2 code", 2002.
- International Organization for Standardization, "ISO 639-2:1998. Codes for the representation of names of languages -- Part 2: Alpha-3 code, first edition", 1998.
- Library of Congress, ISO 639-2 Registration Authority
- SIL International, ISO 639-3 Registration Authority
- International Organization for Standardization, "ISO 15924:2004. Information and documentation -- Codes for the representation of names of scripts", January 2004.
- Statistics Division, United Nations, "Standard Country or Area Codes for Statistical Use", UN Standard Country or Area Codes for Statistical Use, Revision 4 (United Nations publication, Sales No. 98.XVII.9) June 1999
- ANSI/NISO Z39.53, "Codes for the Representation of Languages for Information Interchange", 2001
- Library of Congress, "MARC Code list for languages", 2003