Exporting Data From Anime-Planet

NOTE: This was originally posted as an AniList forum thread, almost exactly two years ago. I recently decided to re-publish it here after seeing a tweet from Anime-Planet announcing their intention to implement data export - and a public API - in the near future.

Larry Wall once said that there are three attributes of a great programmer:

Laziness
Hubris
Something else, which I didn't look up because of point 1.

As a ~~lazy~~ great programmer, I am paradoxically willing to go to extreme lengths to avoid any form of unnecessary work. Sometimes these lengths actually involve more effort than would have been expended in actually doing the thing I am avoiding, but this is who I am don't you try to change me dad.

Several years ago, I signed up for Anime-Planet so that I could keep track of the Japanese Schoolgirl Robot Cartoons I waste my mortality on. Eventually though, I decided to move on, due to a confluence of the activity feeds being ~~broken~~ under maintenance, video advertisements for the Warcraft movie fading in over the actual page content, and the infamous "ratings shamer" image used to "shame" users (me) who didn't leave ratings for things they watched, despite the fact that doing so is largely pointless since nothing really matters and we're all going to die eventually.

I tried signing up for MyAnimeList once; the sign-up process somehow went wrong, leaving me in a quantum superposition wherein I was both registered and unregistered at the same time, unable to log in yet equally unable to create a new account. Emails sent to their support received no response, so I gave up and went back to Anime-Planet.

This time, I decided to try AniList instead. Thanks to my years of training, I was able to conquer the registration page in less than six decades - a definite improvement. Now I just needed a way to transfer the lists from my Anime-Planet account over to my newly-minted AniList profile.

Step 1: Assuming that we live in a universe in which convenience exists.

AniList has a list import feature. This is a good and noble thing.

However, Anime-Planet does not have a corresponding export feature. I stared out past the rivulets of raindrops streaming down the window; why, I pondered, do we live in a world where such injustice is not just tolerated, but commonplace?

Maybe there's a public API? AniList has one. I could easily convert JSON data into the right format if I coulNOPE, there isn't, IDEA ABANDONED.

Since a simple export-and-import approach was clearly off the cards, I went off in search of others who had faced similar circumstances, in hopes of finding an alternative solution.

Step B: Running random code you found on the internet is always a good idea.

Before continuing, I should probably explain how the import/export process works in more detail.

According to The Internet, MyAnimeList allows users to import or export lists in a custom XML-based format for backup purposes. Other sites (such as Anime-Planet and AniList) can import these XML files, allowing users to easily transfer their details elsewhere without needing to re-enter everything manually.

If I could find a way to convert Anime-Planet lists into MAL-style XML, I could import them in the same way. But is such a thing even possible?

Yes, it is. At some point, ~~aliens~~ someone wrote a Python2 script and posted it on pastebin. I downloaded it, and ran it in a terminal window:

This script will export your anime-planet.com anime list and saves it to anime-planet.xml
Enter your username: GoBusto
Traceback (most recent call last):
  File "./anime-planet.com_xml_anime_exporter.py", line 13, in <module>
    html = urllib2.urlopen(baseURL).read()
  File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 473, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 556, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

Oh. It seems that this script doesn't work any more.

Aha! There's an updated Python3 version - let's try that instea-

This script will export your anime-planet.com anime list and saves it to anime-planet.xml
Enter your username: GoBusto
Traceback (most recent call last):
  File "./animeplanet_to_mal_exporter.py", line 17, in <module>
    html = urllib.request.urlopen(baseURL).read()
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

okay.jpg

Why weren't either of these scripts working? Was there some common factor which caused both scripts to fail in the same way?

Then it hit me: They were both written in Python.

In light of this undeniable evidence that it was very clearly Python causing the failures, I decided to adapt one of my old Ruby scripts to do the job instead.

Step Green: Expecting tried-and-tested code to continue working in the future.

Several years ago, I wrote an IRC bot in Ruby as a way of getting familiar with the language. One of the things it could do (via a "plug-in" script) was search Anime-Planet for information about anime or manga when asked to do so by another IRC user. This was my starting point.

Before attempting to modify the script, I first wanted to ensure that it still worked properly:

irb(main):001:0> require './AnimePlanet.rb'
=> true
irb(main):002:0> AnimePlanet.new.run "anime Rozen Maiden"
=> "Sorry, I couldn't find any information about Rozen Maiden"

Spoiler alert: It didn't.

My guess is that the CloudFlare DDoS protection used by Anime-Planet somehow prevents simple scripts from connecting to it and downloading the HTML content, making any attempts to automatically generate an XML file based on the contents of the page impossible.

Step Blarple: Nothing is impossible (don't let your dreams be dreams).

At this point, I gave up; I started the long, boring process of manually copying everything from one browser window into another. I continued for THREE HOURS until I realised that it was 11pm and I had to wake up at 7am for work. Since I need at least 26 hours of sleep per day, I reluctantly turned off my computer and went to bed, angry at the world and all of the things in it (but mostly XML files which could not be auto-generated due to DDoS protection services getting in the way of my valiant attempts to avoid unpaid data entry work).

Just as I was drifting off, a little voice in the back of my head piped up: "Why don't you use Selenium?"

For those of you who have never heard of Selenium, here's a brief introduction: Once upon a time, some dude made a thing called Selenium Remote Control so that he could automatically find errors on the websites he created. This was useful, since it meant that instead of having to manually look for coding errors on his website, he could make coding errors in the scripts used to control Selenium RC instead, thus shifting the blame elsewhere and achieving what is known by us in the industry as Progress.

This AutoIt-but-just-for-web-browsers was later superseded by Selenium Webdriver, which properly integrated with Firefox, Chrome, etc. and allowed them to be controlled by a computer program pretending to be a human being, ~~also known as John Carmack~~. Thus, websites don't see any difference between a browser controlled by Selenium and a browser used by an actual human made of flesh and capillaries and such.

The next day, I came home from work and spent the evening writing a small Ruby script to control Firefox via Selenium Webdriver. The result: A program which asks for a username, and then auto-pilots a browser through each page of their anime and manga lists, recording the things it finds in JSON format. Success!

The script can be found here: https://gitlab.com/gobusto/ap-list-download

The next thing to do would be to either use the JSON data to update AniList via the AniList API, or to simply have the the Selenium script dump the data in XML format, ready for import via the AniList user interface.

A "standard" anime list format is an idea too beautiful for this sinful world.

Wouldn't it be nice if there were some standard file format that all anime/manga list sites used to import/export data? We would no longer live in a world where screen-scraping your own profile page via a semi-autonomous robo-browser servant is necessary to download your own history of cartoon-based amusement.

It would also be nice to live in a world where free kittens were allocated to every citizen, and you could eat pizza for every meal without doctors everywhere crying blood due to your poor (but undeniably delicious) life choices. What I'm saying is that such a wonderful dream can never be, due to the various factors that our harsh, unforgiving reality imposes on each and every one of us:

Everything has like a jillion different names.

How many ways can this name be given? Let us count, together:

Ooyasan wa Shishunki
Ooyasan ha Shishunki
Ooya-san wa Shishunki
Ooya-san ha Shishunki
Ouyasan wa Shishunki
Ouyasan ha Shishunki
Ouya-san wa Shishunki
Ouya-san ha Shishunki
Oyasan wa Shishunki
Oyasan ha Shishunki
Oya-san wa Shishunki
Oya-san ha Shishunki
Ōyasan wa Shishunki
Ōyasan ha Shishunki
Ōya-san wa Shishunki
Ōya-san ha Shishunki
おおやさんはししゅんき
大家さんは思春期

What about this?

Sometimes it's called Working!!
Other times it's called Wagnaria!!
WHICH IS IT, OBAMA?

I can guarantee that whichever one you pick, some other site will pick another option. Unless you're willing to allow every permutation of every possible name for every possible anime as part of your "standard" format, things are going to be somewhat complicated.

Every site has different rating systems.

In the case of AniList, even a single site has multiple rating systems:

?/10
?/10.0
?/100
?/5 stars
:), :|, and :(

How does the :)-iness of a rating on AniList translate to a percentage-based system, or even one where GRAPHICS/GAMEPLAY/SKUB are given as separate scores? Only smarties have the answer.

For manga, some sites use volumes/chapters, some use chapters only.

Anime-Planet allows either volumes or chapters to be used. AniList only allows chapters. How many chapters are there in a volume? lol idk ~~my bff jill?~~

OVA episodes - are they a "series", or are they independent?

Anime-Planet categorises the various Angel Beats animations as follows:

Angel Beats! (13 eps)
Angel Beats! Another Epilogue
Angel Beats! Hell's Kitchen
Angel Beats! Special

AniList, however, categorises them like this:

Angel Beats! (13 eps)
Angel Beats!: Another Epilogue
Angel Beats! Specials (2 eps)

Which is correct? ~~Neither, both options are equally valid~~ Obviously Anime-Planet is wrong, my loyalty to AniList is absolute and unwavering, all hail AniList.

tl; dr

Anime-Planet has no export button.
Doing things "the right way" doesn't work.
Selenium: "Not The Worst Solution"
Standards = Never.