티스토리 수익 글 보기
The post Telling AI to go away (but politely) appeared first on dxw.
]]>Telling AI to go away (but politely)

When it comes to AI – as with all new technologies starting to find their stride – defining the rules about how they should be used lags behind their adoption
One thing we’ve noticed over the past few months is a slow and steady increase in traffic to sites we host from services which most people would label as “AI”. This isn’t a blog post about the pros and cons of AI though; instead this is a post about telling those services – as politely as possible – to go away.
There are many reasons why we might want to do this. Content licensing might forbid re-use of data in certain ways, some content might have more serious implications if AI misinterprets it, or in some cases we might just plain disagree with people using whatever content they feel like to train their language models. Whatever the reason, sometimes we need to say “not today” to the robots.
Unfortunately, the world of AI doesn’t seem to have moved as far as getting AI to read the licence for content it’s crawling and figure out if it’s allowed to. There’s probably a research paper in that for someone, but that’s a job for another day. Instead we’re left with the question of “how can we programmatically tell an AI crawler how it’s allowed to use content on this page?”
What’s the problem?
The default stance of most services which grab content from the internet – search engines and AI crawlers, for example – is that things are fair game unless explicitly told otherwise. Whether this is right or wrong is a minefield of content rights, copyright law and philosophy, but over the years a set of standards have emerged for more mature technologies like search engines. These can be used to describe exactly what can and can’t be done with online content – can they look at a page in the first place for example, and if they do should they then be able to index it?
There’s also a bunch of well-established standards for providing hints to search engines about what’s on a page, how data is structured and how it relates to other pieces of information, and there are standard ways of embedding licence and ownership details in things like images. All of these can be (and mostly are) used by well behaved systems to make informed decisions about how to use content they find online.
When it comes to artificial intelligence – as with all new technologies starting to find their stride – defining the rules about how they should be used lags behind their adoption. We took a quick look into the current options.
What can we do about it?
We’ve been looking at ways we can steer crawlers for AI services into doing what we want them to when it comes to consuming content on sites we run. Here are the most interesting and useful bits of what we found:
Use robots.txt
If you’ve done any kind of search engine tinkering or SEO in the past you’re probably familiar with robots.txt. The full specification is a bit boring, but the short version is it tells web crawlers how they should behave when faced with a site. It also allows us to tell specific crawlers what they can and can’t do, which means we can instruct GPTBot, for example, that it’s not allowed to crawl anything.
There are some problems with this approach though, namely that there’s no way of saying “all AI crawlers”. Instead you need to list each one you care about individually, and the default stance of “go for it” means if you miss one then it will continue to crawl content. People have tried to compile lists, but by their very nature they’re going to become out of date. Without some other standard this is just going to become an ongoing race between new models arriving with their own new crawlers (especially as cheaper compute time and roll-your-own models become more commonplace), and people updating their lists.
It gets even messier when we consider that AI is able to – in effect – conduct its own searches and ‘look’ at web pages to try to answer questions. OpenAI lists 3 crawlers (at the time of writing) which are variously used for training their model, returning search results which they think are useful, and trying to extract information from a page to answer a user query. Deciding where you’re happy drawing the line for the various capabilities of each service (assuming they distinguish them at all) is just another thing to keep track of.
The other big problem with this approach is that while it can technically be used to express preferences on a per-page basis, trying to do this with a large site will rapidly become unwieldy. Trying to express preferences for training, indexing and querying multiple pages between multiple crawlers is a combinatorial explosion. This could rapidly lead to enormous files that services will simply start to ignore as being too big to process in a timely manner.
Use the <meta name=”robots”> tag
On individual pages, we can use the “robots” meta tag to indicate to search engine crawlers that they shouldn’t be indexing a page, as well as slightly more complex concepts such as “you can index this, but don’t display text snippets”. When it comes to AI there isn’t an accepted standard for indicating if content can be used or not, but there is an emerging one:
<meta name=”robots” content=”noai,noimageai”>
This originally came from DeviantArt, but it seems to have a growing acceptance among content-sharing sites as being a de-facto standard for indicating this kind of thing. Whether AI crawlers care about it, on the other hand, is very much unknown.
This also lacks the ability to express the difference between the use of content for training and the use of content for answering queries, meaning it can be a bit of a blunt instrument.
Use domain-specific options
It turns out that there’s a standard mechanism (insofar as it’s included as a .well-known file) – ostensibly for news sites – for expressing trust relationships between services and highlighting things like responsible disclosure policies. This includes a datatrainingallowed parameter which can be used for telling crawlers at a site-level that we don’t want them to be using content for training.
Unfortunately, the same as with robots.txt, there’s no mechanism here for expressing nuance. It’s an all or nothing switch between “use our whole site for training purposes” and “don’t train using any content on our site”. Domain-specific options also suffer from being narrow in scope by design, making them poorly suited for sharing your intent with crawlers operating outside of that domain.
What’s the conclusion?
Ultimately, at the moment, there isn’t a good way to instruct AI crawlers when it comes to them consuming content. Only a set of suggestions with poor flexibility and even poorer implementation from crawlers where it exists at all.
The most robust method with the most support seems to be the use of robots.txt, but new crawlers may appear in the future and start slurping content before they can be included in lists. As services become more complex and with more nuanced abilities robots.txt is also badly suited for expressing different policies page-by-page.
Although the use of noai directives give us flexibility to specify exactly how we want AI to be able to interact with individual pieces of content, support isn’t widely implemented. As far as I can tell, big players like OpenAI simply ignore it, and even if they did offer support it doesn’t let us express concepts like “don’t use this for training, but you can use it to try to answer questions”.
Ultimately though, web standards are built from the ground up by people using them. We’d welcome a much more rigorous discussion around how content providers of all shapes and sizes can tell AI crawlers what they can and can’t do, but in the meantime using what we’ve got available now helps to identify gaps in what we can express and place pressure on those AI crawlers to be good internet citizens.
The post Telling AI to go away (but politely) appeared first on dxw.
]]>The post Just my type: how we use type annotations to build more robust code appeared first on dxw.
]]>Just my type: how we use type annotations to build more robust code

Type checking is one of the tools we use to build software we’re confident will stand up to the unexpected
Imagine you have a child’s shape sorter. It’s a box with a few shaped holes on the top and a set of matching shapes. You can post each shape through its matching hole, into the box below. It doesn’t much matter about the colour of the shape, only that it fits. You can try to push the star through the square as much as you like, but it won’t work. If you do force it through, you might break something.
In programming, we have a similar concept when it comes to information, and we call this “typing“. Each bit of information we pass around in the program has a type. For example, a whole number is an “integer” and a series of characters in a row is a “string”. There are loads of types used to describe generic information, and we can build our own types as well for doing more specific things.
Much as with the shape sorter, though, if we try to push a type into the wrong-shaped hole then things are liable to break. You can’t capitalise a number, for example, or multiply a word. A good example of this crops up in handling user input. A lot of what we do takes input from web forms, and because of how the web works these always arrive as a string (of characters).
Beware the snowman
The programming languages we tend to use like to be helpful, and they might let us pretend that these strings are numbers when we try to do maths with them, especially if they look like a number. That works fine when a user inputs “1337” as we expect, but what about “banana”, or “beware the snowman”? On the other side, what happens if a user submits a phone number and we accidentally strip off the first zero because it looks like an integer?
Fortunately for us there’s a way to help ensure this doesn’t happen, and it’s called type checking. At its core, type checking involves making sure that every possible path through the code can only ever pass the expected type of data. If some part of code expects a string and there’s a chance it could receive an integer then we’ll be told before the code is ever run.
Some of the languages we use to build software at dxw, like C# and Kotlin, force us to do this as part of how the language is built. For others like Ruby and Python, some tooling wraps around the language to keep things in line. In JavaScript, we can (and mostly do) use a variation on the language called TypeScript and transpile it, checking types in the process.
Historically, though, Ruby, Python and JavaScript haven’t made use of type checking until the point things are actually run. This means we could ship a piece of code which was susceptible to the above “banana” bug, and we wouldn’t know until somebody tried to divide by “banana” and ended up with a division by zero error.
Building with types
So, how can we prevent this? It’s actually pretty easy, and it boils down to “use type checking where we aren’t already, and starting to be explicit about the types of data we’re passing around”. For Kotlin and C# this is a done deal, it’s simply not possible to write in those languages without being explicit about types and having those types checked at compile time. Similarly, for anything we write in TypeScript, the language forces us to declare types and have them checked at transpilation.
For Ruby and Python it’s a bit different, as these languages rely on having external tools like Sorbet and mypy to check that things are as expected. Typing in these languages is also optional, and we can add it gradually, in the process giving us extra confidence that things are actually as we expect. This is what some of our delivery teams working on existing code are doing at the moment – adding typing to existing code as it becomes relevant, as well as typing new code as it’s written.
Type checking on its own doesn’t guarantee bug-free programs, but it’s another one of the tools we use along with rigorous testing, automated pipelines and code reviewing to build software we’re confident will stand up to the unexpected.
The post Just my type: how we use type annotations to build more robust code appeared first on dxw.
]]>