Writing Instagram crawler
This note is about small instagram crawler. I used it to download photos and their metadata for a project of visualizing photos from my roadtrip (see Arizona Trip 2016 page).
Introduction
Instagram doesn’t provide API to access your own photos, so we have to download html pages and parse it manually.
The data we need to obtain about photos:
- id
- display image
- thumbnail image
- location (lat + lon)
Let’s write some code to achieve the goal.
1. Get html data from profile page
Okay, we have instagram profile page open in browser, how do we get data shown to us?
First thing I tried was to use View Page Source
menu option.
There was nice json with information about photos:
<script type="text/javascript">window._sharedData = ...
</script>
But it only contained data for first 12 photos to it wasn’t useful to get all of them.
The actual data is hidden in react-root
span, but it’s empty in the page source:
<body class="">
<span id="react-root"></span>
What we can do instead, is to Inspect
the actual DOM in browser debugger and then
Copy > Copy outerHTML
for <body>
tag.
It can probably be automated further, but this is really one-off operation that doesn’t take much time.
NOTE: this method will give only visible photos, so we need to scroll until all desired photos are shown.
2. Extract photo ids
Now we have giant html dump from which we somehow need to extract relevant to us information. I found XPath-based approach to be pretty nice. And of course python already has a library for that.
Let’s do Inspect > Copy > Copy XPath
on two of the photos:
//*[@id="react-root"]/section/main/article/div/div[1]/div[49]/a[1]
//*[@id="react-root"]/section/main/article/div/div[1]/div[50]/a[2]
The difference is only in the last 2 parts. Basically div[X]
specifies row X
,
while a[Y]
specifies column Y
.
Then we can run xpath query on downloaded html:
import lxml
s = open(HTML_FILE).read()
tree = lxml.html.fromstring(s)
selector = \
'//*[@id="react-root"]/section/main/article/div/div[1]/div[*]/a[*]'
elems = tree.xpath(selector)
This will give us a list of all photos which we can iterate over and extract
photo_id
and thumbnail
url. Instances of class lxml.html.HtmlElement
has
a field attrib
that contains all properties of html tag. Let’s use that to get
the data:
# elem == <a class="_8mlbc _vbtk2 _t5r8b" href="/p/BNiq9hsBXo3/?taken-by=pankdm">
tag = elem.attrib['href']
# tag == '/p/BOtc42gAaGL/?taken-by=pankdm'
photo_id = tag.split('/')[2]
img_html = elem.xpath('div/div[1]/img')
thumbnail_src = img_html[0].attrib['src']
3. Extract location
Next thing we need to do is to obtain geo coordinates of the photos:
photo_id -> location_id -> (lat, lon)
We need to download html programmatically. Python library requests
is a good fit:
import requests
r = requests.get('http://instagram.com/p/{}'.format(photo_id))
print r.text
For such pages window._sharedData
in javascript contains enough data to extract what we want,
so we can parse that using regular expressions:
import re
import json
def fetch_json(url):
r = requests.get('url')
match = re.search('window._sharedData = (.*);</script>', r.text)
return json.loads(match.group(1))
Combining this all together:
d = fetch_json('http://instagram.com/p/{}'.format(photo_id))
node = d['entry_data']['PostPage'][0]['media']
location_id = node['location']['id']
display_src = node['display_src']
d = fetch_json('http://instagram.com/explore/locations/{}'.format(location_id))
node = d['entry_data']['LocationsPage'][0]['location']
lat = node['lat']
lon = node['lon']
4. Download photos
The only thing left is to download the actual photo files (both thumbnail and full size).
We can use requests
library to get the job done:
def download_photo(url, path):
r = requests.get(url)
f = open(path, 'wb')
f.write(r.content)
f.close()
download_photo(display_src, 'data/{}.jpg'.format(photo_id))
download_photo(thumbnail_src, 'data/th_{}.jpg'.format(photo_id))