CRAN module

This module imports R packages published at the Comprehensive R Archive Network (CRAN).

Specifically, it reads the table of packages ordered by date of publication. This table contains for each R package the package name, title and date of publication. Based on the package name, each package url can be accessed from:

https://cran.r-project.org/web/packages/<package_name>/index.html

Several attributes are listed for each package. Among them, the following attributes are imported, when present, to the MaRDI knowledge graph:

Version:

Version of the package.

Depends:

Software and package dependencies, including other R packages.

Published:

Date of publication.

Author:

Authors of the package are to be indicated according to the CRAN Repository Policy with the abbreviation [aut]. Given that this guideline is not always implemented, it is not always possible to properly parse the authors.

When no abbrevations describing the role of each individual are included, just the first listed author is imported.

License:

List of accepted licenses.

Maintainer:

Software maintainer (Generally one of the authors).

mardi_importer.cran.CRANSource class

class mardi_importer.cran.CRANSource.CRANSource[source]

Bases: ADataSource

Processes data from the Comprehensive R Archive Network.

Metadata for each R package is scrapped from the CRAN Repository. Each Wikibase item corresponding to each R package is subsequently updated or created, in case of a new package.

packages

Dataframe with package name, title and date of publication for each package in CRAN.

Type:

Pandas dataframe

setup()[source]

Create all necessary properties and entities for CRAN

create_local_entities()[source]
pull()[source]

Reads date, package name and title from the CRAN Repository URL.

The result is saved as a pandas dataframe in the attribute packages.

Returns:

Attribute packages

Return type:

Pandas dataframe

Raises:

ImporterException – If table at the CRAN url cannot be accessed or read.

push()[source]

Updates the MaRDI Wikibase entities corresponding to R packages.

For each package name in the attribute packages checks if the date in CRAN coincides with the date in the MaRDI knowledge graph. If not, the package is updated. If the package is not found in the MaRDI knowledge graph, the corresponding item is created.

It creates a mardi_importer.cran.RPackage instance for each package.

mardi_importer.cran.RPackage class

class mardi_importer.cran.RPackage.RPackage(date: str, label: str, description: str, api: ~mardi_importer.integrator.MardiIntegrator.MardiIntegrator, long_description: str = '', url: str = '', version: str = '', versions: ~typing.List[~typing.Tuple[str, str]] = <factory>, authors: ~typing.List[~mardi_importer.publications.Author.Author] = <factory>, license_data: ~typing.List[~typing.Tuple[str, str]] = <factory>, dependencies: ~typing.List[~typing.Tuple[str, str]] = <factory>, imports: ~typing.List[~typing.Tuple[str, str]] = <factory>, maintainer: str = '', author_pool: ~typing.List[~mardi_importer.publications.Author.Author] = <factory>, crossref_publications: ~typing.List[~mardi_importer.publications.CrossrefPublication.CrossrefPublication] = <factory>, arxiv_publications: ~typing.List[~mardi_importer.publications.ArxivPublication.ArxivPublication] = <factory>, zenodo_resources: ~typing.List[~mardi_importer.publications.ZenodoResource.ZenodoResource] = <factory>, _QID: str = '', _item: ~mardi_importer.integrator.MardiEntities.MardiItemEntity = None)[source]

Bases: object

Class to manage R package items in the local Wikibase instance.

date

Date of publication

Type:

str

label

Package name

Type:

str

description

Title of the R package

Type:

str

long_description

Detailed description of the R package

Type:

str

url

URL to the CRAN repository

Type:

str

version

Version of the R package

Type:

str

versions

Previous published versions

Type:

List[Tuple[str, str]]

author

Author(s) of the package

license

Software license

dependency

Dependencies to R and other packages

imports

Imported R packages

Type:

List[Tuple[str, str]]

maintainer

Software maintainer

Type:

str

_QID

Package QID

Type:

str

integrator

API to MaRDI integrator

date: str
label: str
description: str
api: MardiIntegrator
long_description: str = ''
url: str = ''
version: str = ''
versions: List[Tuple[str, str]]
authors: List[Author]
license_data: List[Tuple[str, str]]
dependencies: List[Tuple[str, str]]
imports: List[Tuple[str, str]]
maintainer: str = ''
author_pool: List[Author]
crossref_publications: List[CrossrefPublication]
arxiv_publications: List[ArxivPublication]
zenodo_resources: List[ZenodoResource]
property QID: str

Return the QID of the R package in the knowledge graph.

Searches for an item with the package label in the Wikibase SQL tables and returns the QID if a matching result is found.

Returns:

The entity QID representing the R package.

Return type:

str

property item: MardiItemEntity

Return the integrator Item representing the R package.

Adds also the label and description of the package.

Returns:

Integrator item

Return type:

MardiItemEntity

exists() str[source]

Checks if an item corresponding to the R package already exists.

Returns:

Entity ID

Return type:

str

is_updated() bool[source]

Checks if the Item corresponding to the R package is up to date.

Compares the last update property in the local knowledge graph with the publication date imported from CRAN.

Returns:

True if both dates coincide, False otherwise.

Return type:

bool

pull()[source]

Imports metadata from CRAN corresponding to the R package.

Imports Version, Dependencies, Imports**m **Authors, Maintainer and License and saves them as instance attributes.

create() None[source]

Create a package in the Wikibase instance.

This function pulls the package, inserts its claims, and writes it to the Wikibase instance.

Returns:

None

write() Dict[str, str] | None[source]

Write the package item to the Wikibase instance.

If the item has claims, it will be written to the Wikibase instance. If the item is successfully written, a dictionary with the QID of the item will be returned.

Returns:

A dictionary with the QID of the written item if successful, or None otherwise.

Return type:

Optional[Dict[str, str]]

insert_claims()[source]
update()[source]

Updates existing WB item with the imported metadata from CRAN.

The metadata corresponding to the package is first pulled from CRAN and saved as instance attributes through pull(). The statements that do not coincide with the locally saved information are updated or subsituted with the updated information.

Uses mardi_importer.wikibase.WBItem to update the item corresponding to the R package.

Returns:

ID of the updated R package.

Return type:

str

process_claims(data, prop_nr, qualifier_nr=None)[source]
parse_publications(description)[source]

Extracts the DOI identification of related publications.

Identifies the DOI of publications that are mentioned using the format doi: or arXiv: in the long description of the R package.

Returns:

List containing the wikibase IDs of mentioned publications.

Return type:

List

get_last_update()[source]

Returns the package last update date saved in the Wikibase instance.

Returns:

Last update date in format DD-MM-YYYY.

Return type:

str

clean_package_list(table_html)[source]

Processes raw imported data from CRAN to enable the creation of items.

  • Package dependencies are splitted at the comma position.

  • License information is processed using the parse_license() method.

  • Author information is processed using the parse_authors() method.

  • Maintainer information is processed using the parse_maintainer() method.

Parameters:

table_html – HTML code obtained with BeautifulSoup corresponding to the table containing the metadata of the R package imported from CRAN.

Returns:

Dataframe with processed data from a single R package including columns: Version, Author, License, Depends, Imports and Maintainer.

Return type:

(Pandas dataframe)

parse_software(software_str: str) List[Tuple[str, str]][source]

Processes the dependency and import information of each R package.

This includes: - Extracting the version information of each dependency/import if provided. - Providing the Item QID given the dependency/import label. - Creating a new Item if the dependency/import is not found in the

local knowledge graph.

Returns:

List of tuples including software QID and version.

Return type:

List[Tuple[str, str]]

parse_license(x: str) List[Tuple[str, str]][source]

Splits string of licenses.

Takes into account that licenses are often not uniformly listed. Characters |, + and , are used to separate licenses. Further details on each license are often included in square brackets.

The concrete License is identified and linked to the corresponding item that has previously been imported from Wikidata. Further license information, when provided between round or square brackets, is added as a qualifier.

If a file license is mentioned, the linked to the file license in CRAN is added as a qualifier.

Parameters:

x (str) – String imported from CRAN representing license information.

Returns:

List of license tuples. Each tuple contains the license QID as the first element and the license qualifier as the second element.

Return type:

List[Tuple[str, str]]

parse_authors(x)[source]

Splits the string corresponding to the authors into a dictionary.

Author information in CRAN is not registered uniformly. This function parses the imported string and returns just the names of the individuals that can be unequivocally identified as authors (i.e. they are followed by the [aut] abbreviation).

Generally, authors in CRAN are indicated with the abbreviation [aut]. When no abbreviations are included, only the first individual is imported to Wikibase (otherwise it can often not be established whether information after the first author refers to another individual, an institution, a funder, etc.)

Parameters:

x (String) – String imported from CRAN representing author information.

Returns:

Dictionary of authors and corresponding ORCID ID, if provided.

Return type:

(Dict)

parse_maintainer(name: str) str[source]

Remove unnecessary information from maintainer string.

Parameters:

x (str) – String imported from CRAN which may contain e-mail address and comments within brackets

Returns:

Name of the maintainer

Return type:

(str)

get_license_QID(license_str: str) str[source]

Returns the Wikidata item ID corresponding to a software license.

The same license is often denominated in CRAN using differents names. This function returns the wikidata item ID corresponding to a single unique license that is referenced in CRAN under different names (e.g. Artistic-2.0 and Artistic License 2.0 both refer to the same license, corresponding to item Q14624826).

Parameters:

license_str (str) – String corresponding to a license imported from CRAN.

Returns:

Wikidata item ID.

Return type:

(str)

get_wikidata_QID() str | None[source]

Get the Wikidata QID for the R package.

Searches for the R package in Wikidata using its label. Retrieves the QID of matching entities and checks if there is an instance of an R package. If so, returns the QID.

Returns:

The Wikidata QID of the R package if found, or None otherwise.

Return type:

Optional[str]

get_versions()[source]