Extract. Transform. Load.

Extract

To extract the data that we needed, we first created multiple data scraper scripts with BeautifulSoup for each website we wanted to analyze.
The websites we scraped were Monster.com, Glassdoor.com, Indeed.com, and CareerBuilder.com.

Each data extractor pulled a company name, job title, salary information, location, job description, keywords, which site the post came from, and finally the URL of the job posting itself

For the keywords, we looked at common keywords used in our class' curriculum and then did a search for them across job descriptions.

Transform

We first stored the results of the four different sites in four different jupyter notebooks.
Before we joined the results, we removed the duplicates to get each unique job posting by putting our data into DataFrames and then organizing from there.
Once that was finished, we added columns to indicate site the data was from.
Then we merged all four DataFrames together and dropped duplicates again. Then for each column, like for example location, we extracted just city and state. For Salary, we removed all formatting and unnecessary information (extracted just the number).
Finally, we filled all empty cells with "NaN".

Load

Once we were finished with cleaning our data, we loaded them into Mongo DB for future use.