Skip to main

API Technical Specs

Learn how to configure and leverage our services to achieve your toughest recruiting needs.
SaaS     |

This documentation is for Version 10 of the Sovren REST API, released on December 15, 2020. Both V9 and V10 use the same parsing and matching engines under-the-hood, but V10 is more streamlined and has a vastly simpler output. Please visit this link for an in-depth comparison.

Resume Parser Overview

The Sovren Resume/CV Parser takes in documents and returns structured json responses representing a human understanding of the data. Your integration task isn't to just get the API call to succeed, but rather to understand the bigger picture and how to properly configure each transaction. For example, different configurations are needed for processing batches of resumes, resumes from college students, or even resumes coming from Australia or New Zealand. Below we will discuss the most important points to understand for your integration.

To parse a resume accurately, you must tell us when that resume was written or submitted to your system. This is not obvious, but it is 100x more important than any other setting when parsing resumes. We cannot determine that date from the resume. You must specify it explicitly. We refer to this date as the Document Last Modified Date.

Document Last Modified Date (formerly Revision Date)

Document Last Modified Dates are required for every transaction because they make or break the accuracy of parsing. Document Last Modified Dates tell the parser when the document was last revised, which impacts the interpretation of terms such as 'current'. In a candidate upload scenario, it's safe to assume a candidate is uploading a version of their resume that's reasonably current and that you can use today's date as the Document Last Modified Date. In any other situation, that means the resume was received sometime before today.

If you leave this date off and parse a batch of 1 million resumes, your oldest and least employable candidates will be distorted as the most experienced, most employable, ready-to-go-to-work candidates despite having received the resume years ago. Let's look at an example below.


Molly Adams

(678) 555-1212
missadams@yahoo.com

930 Via Mil Cumbres Unit 119
Solana Beach, California 92075

Work History
Technical Difference2014 - current
Senior Engineer
  • built company website in .NET
...

Sample Parsed Data Points
Field NameCorrect Last Modified Date (2015-01-01)No Last Modified Date (parsed as today)
Months Experience for job title Senior Engineer12113
Months Experience for skill .NET12113
Last Used Date for skill .NET2014-01-012023-06-06
Total Months Experience12113
Average Months Per Employer12113

As you can see in the sample parsed data points, not using a Document Last Modified Date means the parser can't properly calculate the metadata and reports that the candidate has over 5 times the amount of experience they actually have and that it's in the current time frame. This type of mistake will pollute any searching or matching software and bring these false positives high into the result set.

How do I determine the correct Document Last Modified Date?

The most correct Document Last Modified Date is the last time the file was authored. Since resumes can come from many different places, there are a few things to look for when determining the most correct Document Last Modified Date. Here are a few use cases and our recommended approach for determining the correct date.

File upload control

When a user uploads a file directly from their file system, we would recommend using the last modified date of the file. More documentation of File.lastModified can be found at https://developer.mozilla.org/en-US/docs/Web/API/File/lastModified.

Batch of resumes on disk

When you have a batch of resumes on disk that you are processing you need to look at the last modified date of the files and make sure that they all aren't the same, or within seconds of each other. If those dates are the same, then the metadata of the file was overwritten at some point during file transfer and isn't valid. You need to go back to the source and move those files over using a different approach.

Batch of resumes from a database

When you have a batch of resumes from a database those are usually stored with a profile. If the date modified for the file was stored in the database you should use that, but if not you should look for a last modified date on the profile and use that.

Sourcing resumes from a third-party such as a job board

When receiving resumes from a third-party API they should provide this date in the API response. If you don't see a date, reach out to the third-party to clarify.

Do Not Repeatedly Retry Failed Parsing Transactions

You must institute processing safeguards and kill switches to ensure that you do not violate the AUP. Please note that without explicit programming checks to test whether a transaction failed and whether the document can be legally resubmitted in accordance with the AUP, you will eventually violate the AUP and probably use all your account credits – and more – in a "futile loop of Einsteinian doom". An example of a "futile loop of Einsteinian doom" is logic such as this:

NEVER EVER IMPLEMENT THIS TYPE OF LOGIC
HttpStatusCode responseCode;
foreach (Document doc in batchOfDocuments) {
    do {
        responseCode = SendDocumentToServiceForProcessing(doc);
    } while (responseCode != HttpStatusCode.Ok)
}

The problem with the above loop is that when a particular document fails, it will be resubmitted for processing an infinite number of times, but with no chance that the 765,498th time, or any other time, it will magically succeed.

Batch Parsing Concurrency

When your application needs to parse a set or batch or folder of selected resumes, you MUST parse them one at a time and never concurrently. Parsing one at a time allows you to process over 100,000 documents per day (30 million per year). NEVER EVER program concurrent parsing into software that you provide to end users.

The vast majority of programming errors that cause concurrency violations are due to one of these two integration errors:

  1. You allow recruiters to select resumes from a directory and then you parse those in a batch that sends parallel concurrent transactions rather than processing them serially one by one. There is no need to process such batches in a parallel state, as humans cannot read XX resumes per second.You MUST parse such resumes in a For-Each style of loop, one at a time, and never concurrently.
  2. You have a timed process that kicks off a batch at the same time(s) every day, and this batch is processing transaction unnecessarily in parallel rather than serially one at a time. Again, there is no need to process XXX transactions per second rather than staying within the concurrency limit.

When you have a demonstrable, pressing business need to parse a huge amount of documents in a short period of time (defined as more than 100,000 documents in less than 12 hours for a valid reason that they must be parsed in that small time frame), you must ensure that you never exceed your maximum allowed concurrent requests. You must get this value before each batch by calling the Get Account Info endpoint and looking at the MaximumConcurrentRequests field in the response. You must refer to this value before you start a parsing batch, and after every 1,000 transactions and set your concurrent requests accordingly to not violate our AUP. The value of MaximumConcurrentRequests may change up or down dynamically, so always call Get Account Info rather than assuming it has not changed.

All modern cloud systems implement and enforce such concurrency limits. See, for example, this discussion by Google: https://cloud.google.com/solutions/rate-limiting-strategies-techniques. If you have any questions about batch transactions, please reach out to support@sovren.com. We prefer helping you integrate correctly the first time rather than helping you fix it later!

How Does Resume Parsing Work?

What we call parsing is actually a multi-step process. First, we convert the source document to plain text, analyze it, and decide if the text is usable for parsing. If the plain text is not usable, we immediately return a response indicating the issue. If the plain text is usable it continues on to the parser and then returns a parsed document in the response. The graphic below illustrates this workflow.

Resume Parser Workflow

The vast majority of problems in parsing are not from processing the plain text, but from conversion to plain text. For example, there are many ways documents can be corrupted, or how they look like they are laid out isn't actually how the text is written. The point of explaining this is that when you find a mistake in the output, don't assume it's a parsing mistake. Look at the converted text and see if the converted text is as expected (reads logically). If the converted text is malformed, we cannot fix it.

Documents That Can Cause Problems

If you want to minimize conversion problems, don't use PDF documents. Many PDFs convert/parse fine; however, the reason for most of our "this document did not parse correctly" bug reports is that the document is a corrupt PDF file. PDF is a broken standard that often hides issues with the underlying text. If a PDF is corrupt, there is nothing Sovren can do to make that document convert to text "as a human would read it". More information regarding problems with the PDF format and how to check if a PDF is corrupt can be found here. Additionally, here are some tips for constructing an electronic resume.

Besides corrupt PDFs, we can predict - with very high accuracy - certain types of resumes that will not give satisfactory results.

LinkedIn Profile PDFs

While Sovren currently parses most LinkedIn profiles accurately, we cannot guarantee that we will always be able to do so. LinkedIn is determined to keep their data private by making their PDF profiles not compatible with any parsing software (Sovren and our competitors included). We are constantly working to adapt our parsing algorithms to the various changes LinkedIn makes regularly. It is our prediction that, at some point in the future, LinkedIn will make it impossible to extract any useful information from their profiles. We strongly advise our customers to avoid relying on LinkedIn profiles whenever possible.

Artists & Graphic Designers

The goal of these resumes is to create the most visually unique document representing their skills as a designer. This prevents accurate text extraction because candidates will use images instead of text, have text run diagonally across the resume, use vertical text, etc. Parsing can only be as accurate as the text extracted from the source document.

Extremly Long CVs Typical in Academia & Medicine

These documents are usually tens of pages and are flooded with patents, publications, and speaking events. They have very uncommon ways of writing work experience, and since they are often at a school or university it is easily confused with education.

Images & OCR

We don't provide Optical Character Recognition (OCR) because it introduces a tremendous amount of errors that are too numerous to allowing parsing to be accurate.

Since Sovren supports text-based formats, you can use an OCR provider and send the plain text output to Sovren to be parsed.

Entry Level

The Parser assumes that all resumes contain Employment History and Education. When confronted with a resume that seems to be missing Employment History or Education, it will assume that it has made a mistake, missed that data, and will try to treat other data as Employment History or Education.

Although that's a good strategy, it fails for student/new graduate/under-educated worker resumes where it is probable that their resume really does not contain any Employment History (and perhaps no Education). Therefore, when parsing a resume from a student or recent graduate or a worker with no advanced education (i.e., not even high school), set Coverage.EntryLevel = true in the config string (the default is false). This will tell the Parser that it's acceptable to not find Employment History and will result in more accurate parsing for student/recent graduate resumes only.

Australia / New Zealand / South Africa

In particular, Australia, New Zealand, and South Africa can present challenges in special cases where resumes are written in English and contain contact information with addreses in a 4-digit postal code country. More information on this topic can be found in the Languages section.