Link
Assuming that you have read Authentication Quickstart, you may start linking your company data with S&P Global's Market Intelligence or CapitalIQ data.
Concepts
Link API is a machine learning service that maps your entities to identifiers in S&P's knowledge bases. For example, given a company name in your database you can get a unique company identifier from S&P Capital IQ or S&P Capital IQ Pro database.
Entity Type: A real world concept like a company.
Entity: A single specific real world thing.
Example: "S&P" and "Kensho" are Entities both with an Entity Type of "Company"
Records:
Each Record is a collection of details currently known about an Entity. These details are used to find a corresponding Identifier in a S&P knowledge base.
The only required fields in a Record are uid and name. All fields outside of uid are features used by the model in predicting an Entity Link, with each feature weight determined by the modeI:
- uid: A unique identifier for each input record. Used in the output to show which linked entity corresponds to which input record.
- name: Text representing the name of the entity.
- aliases: Other names the entity might go by
- address or (address_1, address_2): One or more addresses associated with the entity.
- city: A city associated with the entity.
- state or state_abbreviation: A state associated with the entity.
- zipcode: A zipcode associated with the entity.
- country or country_iso3 or country_iso2: a country associated with the entity.
- phone_number: a phone number associated with the entity.
- url: a URL associated with the entity.
- year_founded: the year the entity was founded.
Entity Link: Returned by the Link API. Contains the Identifier from the S&P knowledge base along with a Link Score. Optionally contains any additionally requested knowledge base information about the entity.
Score: The Link Score for a given Entity Link corresponds to the quality of the match to a specific Entity in a S&P knowledge base. It should not be interpreted as an estimated probability. However it can be used to rank multiple matches for a record.
Link File
The Link File endpoint supports requests for linking a .csv file of properly formamtted company information.
Sample Usage
Input File
Valid fields to be included in an input .csv file are as follows:
uid
: str [required]name
: str [required]aliases
: Pipe-separated straddress
or (address_1
,address_2
): strcity
: strstate
orstate_abbreviation
: strzipcode
: strcountry
orcountry_iso3
orcountry_iso2
: strphone_number
: strurl
: stryear_founded
: int
The following is a sample input file:
uid,name,aliases,address,city,state_abbreviation,zipcode,country_iso31,Kensho Technologies,Kensho,44 Brattle St,Cambridge,MA,,USA2,S&P Global,,55 Water St,New York,NY,10038,3,Ratings,,,,,,USA4,Apple,,,Cupertino,CA,,5,Aramco,Saudi Aramco|Saudi Arabian Oil Company,,,,,SAU6,Tencent,,,,,,CHN7,Samsung,,,,,,KOR8,Barclays Egypt,,,,,,9,Roche,,,,,,10,Toyota,,1 Toyota-Cho,Toyota City,,,JPN
Request
The following code snippet displays how to submit and download the output of a file linking request with Kensho's file linking client.
from datetime import datetime, timedeltaimport loggingimport ioimport osimport tempfileimport timefrom typing import Any, Dict, List, Tuple, Optionalimport zipfileimport requestslogger = logging.getLogger(__name__)# Polling statusesJOB_SUCCESS = "SUCCESS"JOB_ERROR = "ERROR"POLLING_TIMEOUT = "TIMEOUT"# Extra fieldsNAME = "name"ADDRESS = "address"CITY = "city"STATE = "state"ZIPCODE = "zipcode"COUNTRY = "country_iso3"ALIASES = "aliases"PHONE_NUMBER = "phone_number"URL = "url"YEAR_FOUNDED = "year_founded"AVAILABLE_EXTRA_FIELDS = {NAME,ADDRESS,CITY,STATE,ZIPCODE,COUNTRY,ALIASES,PHONE_NUMBER,URL,YEAR_FOUNDED,}# Knowledge basesMI = "mi"CAPIQ = "capiq"AVAILABLE_KNOWLEDGE_BASES = [MI, CAPIQ]# Helper function to upload a file to AWS S3def _upload_file(path_to_file: str, aws_info: Dict[str, Any]) -> None:with tempfile.NamedTemporaryFile(mode="w+b") as file_out:with zipfile.ZipFile(file_out, mode="w") as file_out_zip:file_out_zip.write(path_to_file)file_out.flush()file_out.seek(0)files = {"file": file_out}requests.post(aws_info["url"], data=aws_info["fields"], files=files)# Helper function to download the output filedef _download_output(response: requests.Response, output_directory_path: str) -> None:with zipfile.ZipFile(io.BytesIO(response.content)) as zfile:csv_filepath = zfile.namelist()[0]with zfile.open(csv_filepath) as csv_file:csv_filename = os.path.basename(csv_filepath)download_location = os.path.join(output_directory_path, csv_filename)csv_content = csv_file.read()with open(download_location, "wb") as f:f.write(csv_content)# Helper function to catch and re-raise HTTP errorsdef _catch_reraise_error(response: requests.models.Response) -> None:try:response.raise_for_status()except requests.HTTPError as e:if response.text:raise requests.HTTPError(f"{e.args[0]}. Error Message: {response.text}") from Noneelse:raise e from Noneclass LinkFileClient:"""A class to call the Link File API that automatically refreshes tokens when needed."""def __init__(self, link_host: str, refresh_token: str, access_token: Optional[str] = None):self._link_host = link_hostself._refresh_token = refresh_tokenself.access_token = access_tokenlogger.setLevel(logging.INFO)def update_access_token(self):response = requests.get(f"{self._link_host}/oauth2/refresh?refresh_token={self._refresh_token}")if not response.ok:raise RuntimeError("Something went wrong when trying to use Link. Is your refresh token correct?")access_token = response.json()["access_token"]self.access_token = access_tokendef call_api(self, verb, *args, headers={}, **kwargs):"""Call Link API, refreshing access token as needed."""if not self.access_token:self.update_access_token()def call_with_updated_headers(request_method):headers["Authorization"] = f"Bearer {self.access_token}"return request_method(*args, headers=headers, **kwargs)method = getattr(requests, verb)response = call_with_updated_headers(method)if response.status_code == 403:self.update_access_token()response = call_with_updated_headers(method)return responsedef start_job(self,job_name: str,path_to_file: str,knowledge_base: str,extra_fields: Optional[List[str]] = None,country_filter: bool = False,num_top_records: int = 1,include_identifiers: Optional[List[str]] = None,get_ultimate_parent: bool = False,include_input_fields: bool = False,include_score_label: bool = False,) -> Optional[str]:"""Start a new linking job.Args:job_name: Name given to the job.path_to_file: Path to the input csv file.extra_fields: Extra response fields to output.country_filter: Whether to filter by a given country.knowledge_base: Name of knowledge base to link against.num_top_records: Maximum number of top records for each link.include_identifiers: Cross-reference ID types to output.get_ultimate_parent: Whether to fetch the Ultimate Parent Company.include_input_fields: Whether to include input fields in output file.Returns:UUID of the job if it is started successfully otherwise None."""extra_fields = extra_fields or []if not set(extra_fields).issubset(AVAILABLE_EXTRA_FIELDS):raise ValueError(f"Extra fields {set(extra_fields)-AVAILABLE_EXTRA_FIELDS} not allowed.")if knowledge_base not in AVAILABLE_KNOWLEDGE_BASES:raise ValueError("Invalid knowledge base. "f"Knowledge base should be one of {AVAILABLE_KNOWLEDGE_BASES}")file_name = os.path.basename(path_to_file)create_job_request_data = {"job_name": job_name,"file_name": file_name,"knowledge_base": knowledge_base,"model_name": "generic","num_top_records": num_top_records,"country_filter": country_filter,"extra_fields": extra_fields,"include_identifiers": include_identifiers,"get_ultimate_parent": get_ultimate_parent,"include_input_fields": include_input_fields,}logger.info("Creating job with following params: %s", str(create_job_request_data))create_response = self.call_api("post",f"{self._link_host}/api/linkfile/v0/create",json=create_job_request_data,headers={"Content-Type": "application/json"})_catch_reraise_error(create_response)job_info = create_response.json()job_id = job_info["job_id"]logger.info("Uploading file: %s", path_to_file)aws_info = job_info["aws_upload_url"]_upload_file(path_to_file, aws_info)logger.info("Starting linking job with id: %s", job_id)start_response = self.call_api("get",f"{self._link_host}/api/linkfile/v0/start/{job_id}",headers={"Content-Type": "application/json"})_catch_reraise_error(start_response)return job_iddef get_status(self, job_id: str) -> Dict[str, Any]:"""Get status of the job.Args:job_id: UUID of the job.Returns:A dict containing the current status as well as other information about the job."""response = self.call_api("get",f"{self._link_host}/api/linkfile/v0/status/job/{job_id}",headers={"Content-Type": "application/json"})_catch_reraise_error(response)return response.json()def poll_for_completion(self, job_id: str, interval: float = 30.0, timeout: Optional[float] = None) -> Tuple[str, str]:"""Poll the server until the job is complete.Args:job_id: UUID of the jobinterval: Polling intervaltimeout: Timeout for when the polling loop exits."""poll_timeout_time = datetime.now() + timedelta(seconds=timeout) if timeout else Nonewhile True:status = self.get_status(job_id)logger.info("%s Current status: %s", datetime.now(), status["status"])if status["status"] == JOB_SUCCESS:logger.info("Job completed successfully. Output file is ready to be downloaded.")return JOB_SUCCESS, Noneelif status["status"] == JOB_ERROR:logger.info("Job failed.")return JOB_ERROR, status["message"]time.sleep(interval)if poll_timeout_time is not None and datetime.now() >= poll_timeout_time:logger.info("Polling timed out")return POLLING_TIMEOUT, Nonedef download_job_output(self, job_id: str, output_directory_path: str) -> None:"""Download the output of the linking job.Args:job_id: UUID of the job.output_directory_path: Path to the directory where the output file should be downloaded."""download_response = self.call_api("get",f"{self._link_host}/api/linkfile/v0/download-url/{job_id}",headers={"Content-Type": "application/json"})_catch_reraise_error(download_response)download_response_json = download_response.json()presigned_url = download_response_json["file_path"]response = requests.get(presigned_url)_download_output(response, output_directory_path)logger.info("Output file has been successfully downloaded.")def ping_server(self) -> None:"""Ping the server to test the readiness of the server."""response = requests.get(f"{self._link_host}/statusz")response.raise_for_status()logger.info("Link server at %s is ready to receive requests.", self._link_host)# Endpoint and auth infoLINK_HOST = "https://api.link.kensho.com"ACCESS_TOKEN = "<token obtained from login>"REFRESH_TOKEN = "<token obtained from login>"# Input file and output directory pathsinput_file_path = "path_to_input_csv_file"output_directory_path = "path_to_output_directory"logging.basicConfig(level=logging.INFO)client = LinkFileClient(link_host=LINK_HOST, refresh_token=REFRESH_TOKEN, access_token=ACCESS_TOKEN)job_id = client.start_job("Test Job", input_file_path, CAPIQ, extra_fields=[NAME], country_filter=False, num_top_records=1, include_identifiers=[], get_ultimate_parent=False, include_input_fields=False)final_status, error_msg = client.poll_for_completion(job_id)if final_status == JOB_SUCCESS:client.download_job_output(job_id, output_directory_path)elif final_status == JOB_ERROR:logger.info("Error: %s", error_msg)
Output File
Below is the expected output file with the sample file as input and with the file linking job started as done in the above code snippet.
uid,sp_company_id,link_score,input_name,name1,251994106,99.96,Kensho Technologies,"Kensho Technologies, Inc."2,21719,99.41,S&P Global,S&P Global Inc.3,7642076,22.04,Ratings,S&P Global Ratings Inc.4,24937,99.69,Apple,Apple Inc.5,1241120,99.07,Aramco,Saudi Arabian Oil Company6,11042136,99.77,Tencent,Tencent Holdings Limited7,91868,98.44,Samsung,"Samsung Electronics Co., Ltd."8,13401047,97.14,Barclays Egypt,Attijariwafa bank Egypt S.A.E.9,687140,99.39,Roche,Roche Holding AG10,319676,98.65,Toyota,Toyota Motor Corporation
On-Demand
The following endpoint can be used for matching company data on-demand:
/api/ondemand/v0/companies/<knowledge_base>/<model_name>
This on-demand endpoint asks the user to provide a knowledge base and model name in the URI.
knowledge_base
is either "mi" or "capiq".- "mi" allows the user to link against the Market Intelligence dataset
- "capiq" allows the user to link against the CapitalIQ dataset.
model_name
(at this time) must be "generic".
A JSON request body should be provided with the following format:
{"num_top_records": Optional[int] = 1,"records": [{"uid": str,"name": str,"aliases": Optional[List[str]] = [],"country_iso3": Optional[str] = None,"address": Optional[str] = None,"state": Optional[str] = None,"city": Optional[str] = None,"zipcode": Optional[str] = None,"year_founded": Optional[int] = None,"url": Optional[str] = None,"phone_number": Optional[str] = None,}]}
Note: Unlike the link file endpoint, the on-demand endpoint only accepts the country_iso3
field.
It will ignore any country
or country_iso2
fields.
Sample usage
Request
The following example links companies to the CapitalIQ dataset with the "generic" model.
The URI indicates the knowledge base (capiq
) and model name (generic
):
https://api.link.kensho.com/api/ondemand/v0/companies/capiq/generic
The JSON body requests links for two companies, "S&P Global Inc." and "kensho". num_top_records
indicates how many records
to return for each company. For "S&P Global Inc.", all possible fields are provided, whereas for
"kensho" only the required fields (name
and uid
) are provided.
{"records": [{"uid": "1","name": "S&P Global","aliases": ["S&P", "SPGI"],"country_iso3": "USA","address": "55 Water Street","state": "New York","city": "New York City","zipcode": "10041","year_founded": "1860","url": "www.spglobal.com","phone_number": "(212) 438-1000"}],"include_response_fields": ["name"],"num_top_records": 1}
Response
The output of the endpoint contains metadata that describes the entity type, knowledge base, model name and model version used for the request.
The output for each linked record contains the S&P knowledge base ID and the associated link score. The results for
each record are returned in descending order of the link_score
field.
{"entity_type": "companies","knowledge_base": "capiq","model_name": "generic","records": [{"input_name": "S&P Global","links": [{"name": "S&P Global Inc.","sp_company_id": "21719","link_score": 0.991182643610278,}],"num_links": 1,"uid": "1"}]}
The following is an example of how to interact with the API using Python:
import jsonimport requestsLINK_URL = 'https://api.link.kensho.com/api/ondemand/v0/companies/capiq/generic'request_json = {"records": [{"uid": "1","name": "S&P Global","aliases": ["S&P", "SPGI"],"country_iso3": "USA","address": "55 Water Street","state": "New York","city": "New York City","zipcode": "10041","year_founded": "1860","url": "www.spglobal.com","phone_number": "(212) 438-1000"}],"include_response_fields": ["name"],"num_top_records": 1}response = requests.post(LINK_URL,json=request_json,headers={'Content-Type': 'application/json','Authorization': 'Bearer <token obtained from login>'})linked_results = response.json()
BECRS Cross-reference ID Lookup
If the Requestor has BECRS entitlements they can add include_identifiers
to the request. Link will perform the cross-reference lookup through BECRS and add the results to the response. If include_identifiers
is added to the request, the ultimate parent information for each company may also be fetched. This can be toggled through the get_ultimate_parent
boolean field.
The following example response requests the DUNS and SNL identifier types as well as the ultimate parent company info:
{"num_top_records": 1,"records": [{"uid": "1","name": "Kensho"}],"include_response_fields": ["name"],"include_identifiers": ["DUNS", "SNL"],"get_ultimate_parent": true}
The corresponding response will have fields for the requested identifiers added to the response:
{"entity_type": "companies","knowledge_base": "capiq","model_name": "generic","records": [{"input_name": "Kensho","links": [{"name": "Kensho Technologies, Inc.","sp_company_id": "251994106","link_score": 0.9933911614820168,"DUNS": ["079246675"],"SNL": ["5269941"],"ultimate_parent_company": {"name": "S&P Global Inc.","id": "21719"}}],"num_links": 1,"uid": "1"}]}
Note: The cross-reference ID fields are lists of IDs since BECRS can return multiple cross-reference IDs for the same knowledge base ID. In addition, BECRS cross-reference lookups are currently only available when linking with the CapIQ knowledge base.
Converting On-Demand input to Link File input
On-Demand is limited to 100 records in a single request. There are plans to support a larger number of records in the future. For now the following snippet can be used to convert the JSON body of a On-Demand request to an input file usable by the File Link endpoint.
import csvimport jsonimport uuidfrom pathlib import Pathdef write_ondemand_body_to_csv(json_body: str, output_directory: str) -> str:"""Converts the JSON for an on-demand request to a file usable the with File Linking EndpointArgs:json_body: the POST body of the on-demand endpointoutput_directory: the directory to write the output folder tooReturns:the concrete path for the file that was created"""raw_json = json.loads(json_body)filename = f"records-{uuid.uuid4()}.csv"concrete_path = Path(output_directory) / filenamefieldnames = "uid", "name", "aliases", "country_iso3", "address", "state", "city", "zipcode", "year_founded", "url", "phone_number"with concrete_path.open('w', newline='') as csvfile:writer = csv.DictWriter(csvfile, fieldnames=fieldnames)writer.writeheader()for record in raw_json['records']:if "aliases" in record:record["aliases"] = "|".join(record["aliases"])writer.writerow(record)return str(concrete_path)
Note that the snippet does not check that the directory exists or if the json is formatted correctly.