switch instagram to scraping due to new permission policy :( #603

Closed
snarfed opened this Issue Jan 15, 2016 · 31 comments

5 participants

@snarfed
Owner

Instagram is locking down their API and requiring all apps to go through a review process similar to facebook's. details in snarfed/granary#65.

they're mainly locking down /users/self/feed and /media/popular and sending photos outside of instagram, neither of which bridgy does, so i think we'll be ok, but no guarantees.

TODO for switching to scraping:

  • poll
  • mf2 handlers. (added scraping support to get_comment, get_like, etc.)
  • signup. started in the scrape_instagram branch.
    • if their account is protected, complain and don't finish signup.
  • cron job that updates profile pictures.
  • figure out backward compatibility for existing accounts. poll/propagate work ok, but...
    • data migration: remove the publish feature from all existing accts so that when they delete listen, their acct disappears correctly.
    • delete. indieauth into the first website we have for them. if they don't have any, make them add one and re-login first.
  • cache comment and like counts? like we already do for twitter and G+, so that we only fetch individual photo pages when we need to. should help keep our load lower and off IG's rader for a while.
  • handle /instagram/bret.io. he evidently changed his username from bret.io to uhhyeahbret, but we don't periodically refetch profiles (#304), so we didn't notice. plus username is the datastore entity key id, so it's tough to change. easiest answer will be to ask him to sign up again after we've ported signup. (done.)
  • delete. (was briefly blocked on aaronpk/IndieAuth.com#113.)
@snarfed snarfed added the now label Jan 15, 2016
@snarfed
Owner

the new set of oauth scopes aka permissions is on https://www.instagram.com/developer/authorization/ :

  • basic - to read a user’s profile info and media (granted by default)
  • public_content - to read any public profile info and media on a user’s behalf
  • follower_list - to read the list of followers and followed-by users
  • comments - to post and delete comments on a user’s behalf
  • relationships - to follow and unfollow accounts on a user’s behalf
  • likes - to like and unlike media on a user’s behalf
@snarfed
Owner

i started on the review process, but stopped when i saw it requires a screencast. ugh.

i'll do that eventually. here's the rest of what i have written up so far:

https://www.instagram.com/developer/clients/580be8883446443d8216ebdf0462f3b8/review/

1. Description

Got a blog? Do you post your public Instagram photos on your blog? Bridgy notifies your blog posts when people like or comment on your photos on Instagram.

2. How does your app use the Instagram API?

Bridgy helps individual users share their own content with their own web sites. Specifically, when a user posts a photo on their own web site (by any means) as well as Instagram, Bridgy notifies their web site when people like that photo or comment on it inside Instagram. This requires the basic permission.

Bridgy only operates on public accounts. It does not support private accounts.

Bridgy also has a publish feature that integrates with users' web sites in the other direction. Users can post on their web site that they like an Instagram photo, or have a comment on it, and they can then use Bridgy to post that comment or like that photo inside Instagram. These require the likes permission, which Bridgy currently has, and the comments permission, which it doesn't.

3. Do you need additional permissions?

Permission: likes

Users can post on their web site that they like an Instagram photo. They can then use Bridgy to like that photo inside Instagram.

Permission: comments

Users can post on their web site a comment on an Instagram photo. They can then use Bridgy to post that comment on that photo inside Instagram.

@snarfed
Owner

made the screencast: https://youtu.be/eGMNItivBdY

@snarfed
Owner

...and submitted to instagram for approval. fingers crossed! https://www.instagram.com/developer/clients/580be8883446443d8216ebdf0462f3b8/edit/#permissions

@snarfed
Owner

they denied us. :(

Invalid Use Case: The use case described in your submission notes, screencast and website is not a valid use case that we allow on our Platform. Please see our Permissions Review and valid use cases description (https://www.instagram.com/developer/review/) for more information.

well. that's a problem.

they also denied commenting and liking, which is a bit less surprising, and due to a technicality: we didn't describe our use case well. meh.

This permission (comments) does not support the use case you described in your submission notes, screencast and website. Please review Login Permissions (http://instagram.com/developer/authorization) for a comprehensive list of permissions and valid use cases.
likes:

This permission (likes) does not support the use case you described in your submission notes, screencast and website. Please review Login Permissions (http://instagram.com/developer/authorization/) for a comprehensive list of permissions and valid use cases.

@snarfed
Owner

next step: apply for oauth-dropins and see if i can get it approved. not holding my breath, but i'd like to find at least one app i can get approved, just to see how the process works all the way through.

@snarfed
Owner

done. fingers crossed!

@snarfed
Owner

oauth-dropins got rejected too. :/

Still in Development: Your app is still in development. Please resubmit only when your app is ready to go live and no longer in development.
Invalid Use Case: The use case described in your submission notes, screencast and website is not a valid use case that we allow on our Platform.

@snarfed
Owner

i'm running out of ideas. i may have to start scraping. :/

@snarfed snarfed changed the title from submit to instagram's new app review/sandbox process to handle instagram's new app permissioning Feb 1, 2016
@snarfed snarfed changed the title from handle instagram's new app permissioning to handle instagram's new permission policy Feb 1, 2016
@snarfed
Owner
snarfed commented Feb 5, 2016 edited

i took a brief look at what it would take to switch to scraping. the good news is, it's doable. instagram profile and photo pages happily serve without being logged in, and the data is easily available in JSON that we already have code to extract and parse.

the bad news is, profile pages only include counts of comments and likes for each photo, not the actual data about them. we'd have to fetch the individual photo pages to get the data. annoying, but not too bad. we already do this for twitter and google+.

the more worrisome part is that comments and likes are paged, so fetching the photo only gets us the first 10 of each. hrmph. if it's the most recent 10, we'll be able to backfeed at least 10 comments and likes per photo per poll period (20m right now)...but i expect some people peak above that sometimes. hrmph.

@kylewm
Collaborator

Iiiiiii'd give some serious thought to whether it's worth the effort. Because of aaronpk/OwnYourGram#16 PESOS doesn't work for many people any more anyway, and it's very likely OYG will be cut off altogether (even if he rebrands it).

I'm curious what the situation with IFTTT/Zapier/etc. integration is... whether their channels will be shut off too.

@snarfed
Owner

hrmph, true. point taken.

i still posse to IG manually, so i may still do it if only for myself. we'll see.

@kylewm
Collaborator

well, if you do do it, I'll certainly continue to use it :P

@snarfed snarfed self-assigned this Feb 21, 2016
@snarfed snarfed changed the title from handle instagram's new permission policy to switch instagram to scraping due to new permission policy :( Feb 21, 2016
@snarfed snarfed added a commit to snarfed/granary that referenced this issue Feb 21, 2016
@snarfed instagram scraping: implement fetch_extras e8f1eff
@snarfed
Owner

ok, this is implemented, naively. it has to do an HTTP fetch per picture, in serial, to get comments and likes. ideally, those would be parallelized, and also cache and check the counts like G+ now (and i think twitter) so it only does the fetches when there are new comments or likes.

@snarfed snarfed added listen now and removed now labels Feb 21, 2016
@snarfed snarfed removed their assignment Feb 23, 2016
@snarfed
Owner

heh. good question! fortunately we'd scrape your profile page, not your feed, and profile pages probably won't be algorithmic.

@snarfed snarfed added a commit to snarfed/granary that referenced this issue Mar 23, 2016
@snarfed instagram scraping: use ID_USERID format for media ids 61b24a2
@snarfed
Owner

open question: how to do auth for Instagram users, ie prove that they own an account before signing up or deleting, without the API?

the only answers I've come up with so far are 1) no auth and 2) indieauth, and check that the same domain is in the Instagram profile...

...in which case we'd need snarfed/oauth-dropins#10 (indieauth support).

@snarfed
Owner

we'll also need to port the cron job that updates profile pictures.

@snarfed
Owner

starting a todo list in the description.

@snarfed snarfed added a commit to snarfed/granary that referenced this issue Mar 28, 2016
@snarfed switch instagram to scraping. *cry* d48826b
@snarfed
Owner

looks like the mf2 handlers were ok after all. the ID_USERID is expected, happens now, and works ok. i got a 200 from https://api.instagram.com/v1/media/1209758400153852506_1103525 just now, which 404ed earlier. so maybe a transient instagram problem? seems unlikely, but possible.

i still have to port the mf2 handlers themselves from the api to scraping, but that's separate.

@snarfed
Owner

wow. evidently the real problem is that the API returns incomplete data. eg https://api.instagram.com/v1/media/1209758400153852506_1103525 says there are 10 likes but only includes 4 of them. the embedded JSON data in the HTML, https://www.instagram.com/p/BDJ7Nr5Nxpa/ , includes all 10.

pretty clear. on to porting the handlers!

@snarfed snarfed added a commit to snarfed/granary that referenced this issue Mar 30, 2016
@snarfed add Instagram.id_to_shortcode. for snarfed/bridgy#603 6f312e3
@snarfed snarfed added a commit to snarfed/granary that referenced this issue Mar 31, 2016
@snarfed instagram scraping: port get_comment() to scrape 4f0993a
@snarfed snarfed added a commit to snarfed/granary that referenced this issue Mar 31, 2016
@snarfed instagram scraping: test get_like(). for snarfed/bridgy#603 b967d78
@snarfed snarfed added a commit to snarfed/granary that referenced this issue Apr 1, 2016
@snarfed instagram scraping: add get_actor() support 98e3268
@snarfed snarfed added a commit to snarfed/granary that referenced this issue Apr 1, 2016
@snarfed instagram scraping: html profile to actor: mark private accounts
...with actor['to'] == [{'objectType':'group', 'alias':'@private'}]

snarfed/bridgy#628, snarfed/bridgy#603
8884499
@snarfed
Owner

delete is blocked on aaronpk/IndieAuth.com#113.

@snarfed
Owner

current plan for deleting legacy API accounts is that we'll indieauth into their first web site in domain_urls, which means delete won't work for accounts without any web sites. they'll need to re-login (with indieauth) first. here are those accounts:

/instagram/adamdohm
/instagram/amohd2
/instagram/andresin87
/instagram/chellebb
/instagram/debbite
/instagram/dougmckown
/instagram/eddy.arnold
/instagram/espylaub
/instagram/fck_yeah_
/instagram/fermentationfan
/instagram/hendryque
/instagram/isapien
/instagram/jamieontiveros
/instagram/johnbenson
/instagram/mathewi
/instagram/mistermaumau
/instagram/nikolnieto
/instagram/njashanmal
/instagram/photofox
/instagram/realkoyuchan
/instagram/silveradepy
/instagram/srevo
/instagram/the_timweston
/instagram/tylergillies
/instagram/zlojkashtan
@snarfed
Owner

ran this in remote_api_shell to remove publish from all instagram accounts:

for i in Instagram.query(Instagram.features == 'publish'):
  i.features.remove('publish')
  i.put()
@snarfed
Owner

flipped the switch! all instagram accounts are now on scraping and using indieauth for login/delete. fingers crossed!

@snarfed
Owner

looking good so far. tentatively closing. woo!

@snarfed snarfed closed this Apr 4, 2016
@Johnathangalliano

@snarfed May I ask how you passed the "Still in Development: Your app is still in development. Please resubmit only when your app is ready to go live and no longer in development." part? I am trying to submit my app now and I get this error back. And I can't for the life of my understand what it means. Sorry to hijack your thread but you seem to be the only one that has faced this issue.

@snarfed
Owner

@Johnathangalliano sounds like your app is still in sandbox mode? https://www.instagram.com/developer/sandbox/

i didn't actually get approved, so i don't have more specific advice, sorry. i switched to scraping their html instead. :/

@rummykhan

i was also doing scrapping, and it was all going very well, but suddenly my all accounts started getting limit exceeded. even when i sign in. do u have a fix for this.. and did you monitor the rate limit on different end points?
thanks

@snarfed
Owner

@rummykhan if you're getting 429s, then yeah, instagram rate limits HTTP requests by IP address or subnet. i hit that at one point too. lots of details in #665 and https://groups.google.com/d/msg/google-appengine/rpendSIxJMo/_u4G6uXiBQAJ .

@rummykhan
rummykhan commented Aug 15, 2016 edited

thanks @snarfed and yea i was getting response code 429, today i did some testing and what i found is here..
maybe it'll help somebody.

Instagram Scrapping WORKAROUND

Tests:

2.  Get Posts of a user


    Test # 1 (Instgram Form Auth - Account 1)
    ------------------------------
        Login Status = Success

        Minutes     = 2:42
        Seconds     = 162
        Requests    = 354

        After this got response code 429 (Limit Exceeded)


    Test # 2 (Instgram Form Auth - Account 2)
    ------------------------------
        Login Status = Success

        Minutes     = 3:11
        Seconds     = 191
        Requests    = 400

        After this got response code 429 (Limit Exceeded)


    Test # 3 (Instgram Form Auth - Account 3)
    ------------------------------
        Login Status = Fail (Asked for email/phone verification)

        Minutes     = 3:13
        Seconds     = 182
        Requests    = 393

        After this got response code 429 (Limit Exceeded)

        Observation
        -----------
        1. We can get the user posts without being logged in.


    Test # 4 (No Auth - Time Delay 1 Second)
    ------------------------------
        Minutes     = 173
        Seconds     = 10438
        Requests    = 7051

        State: Stopped intentionally

Key Observation

  1. Requests Counts are ip based (which previously i thought are user based.)

Solution


  1. Use Proxies to avoid rate limiting. (Change the proxy as you receive 429)
  2. To Enhance speed Use Python multiprocessing with proxy chaining.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment