JSALT: Summer School and Getting Oriented

Every year, Johns Hopkins University holds the Frederick Jelinek Memorial Summer Workshop (about page here). This human language technologies (HLT) workshop has taken place almost every year since the mid-90s. This year, 2019, is the sixth year that it’s held in honor of  Frederick Jelinek.  It’s abbreviated JSALT – which, I have to say, doesn’t make much sense to me. Jelinek Summer Annual (workshop) on Language Technologies is the best I can come up with to reconcile the full name and the abbreviation.

This year, it is held in Montreal. I couldn’t imagine a better place to spend the summer. It’s usually at a university in the U.S., but this year it’s held at École de Technologie Supérieure (ETS). One of the organizers did his Ph.D. at ETS and took advantage of his contacts there to make JSALT happen in Montreal. Last December, the Hopkins organizers accepted proposals from professors all over the world who had an idea for a research direction during the workshop. The five following projects were selected:

Continue reading “JSALT: Summer School and Getting Oriented”

Hack&Roll 2019

Before leaving for Singapore, I was looking up some events that would be happening in the city in the first few weeks I would be here. I came across a hackathon, which piqued my interest. To my surprise (and delight), it was to be held at NUS where I would be studying. I registered for the event (Hack and Roll 2019), an easy process as I could claim student-ship at the hosting university. I wasn’t completely set on doing it but wanted to keep my options open for Things-To-Do upon arrival.

Unfortunately, none of the other exchangers had seen the event (though many were interested in participating when I told them about it) but it was too late to sign up. Because teammates can be – and generally are – the defining aspect of experience at a hackathon, I hesitated to work with people I didn’t already know. The hackathon organizers made a Slack channel for people to post about themselves and their skills, so I decided to go ahead see if I could find people with whom I would work well. I didn’t find anyone at first but eventually teamed up with two students studying computer science at NUS: one fourth-year student and the other a second-year.

We ended up making a web app that provides a set of job recommendations based on the user’s resume. The advantage of JobMatcher (as we so creatively called it) was that it centralizes job postings from multiple sources – our prototype just included only listings from Indeed. The main technical challenge was coming up with a sorting algorithm that somehow prioritizes the collected jobs. Say, for example, the scraping module gathered 100 jobs from searches based on keywords from the user’s resume; how are we to present the jobs to the user? what makes a job “most relevant” to a user? Although we could have spent the whole hackathon thinking about this issue, we just decided to sort according to the number of matched keywords between the search results and the user resume. This solution is pretty much as brute-force and crude as you can come up with but actually resulted in jobs (at least, for us) that were quite relevant and was easy to implement.

We had originally wanted the user to provide access to their LinkedIn account as input (rather than a resume). LinkedIn provides much more structured data than resumes, making it easier to process. I actually spent the majority of my time working with the LinkedIn API to implement authentication. Although I learned quite a bit (mostly about OAuth authentication), we didn’t actually use LinkedIn as the source of user career information because of privacy issues. In order to gain full access to a user’s account, we would have had to have been registered and verified as LinkedIn developers. Because we weren’t, we only had access to the user’s name, email, and tag line … clearly not enough career info to make predictions on which jobs listings would fit them.

Here’s the project on Devpost and the repo on GitHub.

I logged some of my thoughts throughout the hackathon, which are listed below. I didn’t start writing down notes until we were a bit into the process so the picture is not quite complete but it’s still representative of the ups and downs we experienced in developing the app.


16.32 First task after deciding what we actually want to work on- looking up how to use GCloud with a backend Flask framework. We’ve discussed using Google App Engine to deploy our app (rather than Heroku).

18.00 after about 2 hours debugging – GCloud Talent Solution is in fact not what we need. It’s for a job portal, for example, where you can add your own jobs and then search for them. Big waste of time.

19.27 Post-dinner. We decided to continue with the project despite encountering significant setbacks. I am starting work on extracting key info from LinkedIn profiles. I first create an app on their website, and receive an application ID and secret key with which I can make requests on behalf of the user.

22.19 I just got the profile authentication and information in JSON form. Now, a user can log in on our page. Subsequently, we will be able to retrieve information from their profile. We plan to use their top 3 skills listed on their LinkedIn profile.

23.30 We just realized: LinkedIn doesn’t allow full profile access to non-registered developers. It’s a matter of privacy – they don’t allow third parties to access the complete profile.

02.03 We are trying to get around not having access to the profile. There are a few “alternative” solutions – like a webcrawler that could go to the profile and search through the HTML for relevant information (past job titles, skills)

02.57 The webcrawlers were to no avail, so we decided to just use a PDF text extractor to get information from a user’s resume. Surprisingly, there isn’t an obvious choice for a Python package to convert a PDF to text. PyPDF, for example, just returns a file containing large amounts of whitespace. We aren’t able to install other libraries, like PDFtoText that should make life easy.

03.36 Even after more searching and more debugging and help from a Hackathon organizer, we weren’t able to find a python module to extract text. We are now looking into front-end modules written in JS that would serve the same purpose.

04.45 I ended up going back to the Python modules with an ugly implementation because I feel like we don’t have any other choice.
Got the text extractor to work, and am able to write the text to a file. We are now looking at how to extract relevant information from   the text file. Most importantly, we should be able to get keywords in the resume that would be useful in a “keyword search” in a job portal. The Python module Rake, a “domain-independent keyword extractor,” doesn’t do a great job getting relevant words. For one, it doesn’t recognize the difference between standard ASCII characters and special characters in the PDF. For example, a crude implementation returns “2015 interests snowboarding • triathlons • piano • backpacking • ultimate frisbee” as the most relevant phrase in my resume. Don’t think that’ll work.

05.08 The main problem that we are encountering is unstructured, non-standard data from the resume.

06.54 I’m running out of steam. We got the keyword search working, we just have to pretty-fy everything and come up with a sorting algorithm to a really rank the results.

I noticed the birds are chirping, so I walked outside to get a little closer. The sun’s almost out, I think I might use this early morning opportunity to see the sunrise.

The hackathon venue is right next to the tower block, so I decided to watch the sunrise from up top. I swiped in with my student card for the general area, and then headed for an elevator to take me up the 25 flights of stairs. I put my key card in front of the access point and tried to push one of the floors, but nothing. Denied. I guess I will have to work for the sunrise, climbing the stairs up to the top. About halfway up, I stopped to make SURE the elevator doesn’t work. I came inside and started pressing the buttons for the top floors, but they didn’t stay lit up. So I leaned down and started pushing buttons somewhat hoping the machine would malfunction and I’d catch a ride up. Then I noticed the doors were closed, and the “open door” button wouldn’t work without a valid key card. So yaaaaa I’m currently trapped in the elevator debating what I can do.

Update I called the Mysterious Elevator Man From Above with the push of a button and he said ground floor was my only option so I’m back to squ… floor one. Debating if I’ll make it up.

Update I made it. Nice view but not much of a sunrise – I can only see out the west end of the tower.

9:00 Took a wee nap. It was nice but I don’t feel too great. Aadit and Pankaj are working furiously on the UI of the web app, trying to display the info nicely with React.

12:00 We got everything looking semi-presentable and have practiced our project pitch to a few other groups. Now we wait to present to the judges.